 All right, welcome back to the second of our online classes here from 162. And just so you know that we should have closed captioning going on today. So you can select that option if you like. What I'm going to do is remind everybody, let's keep trying to type questions into the group chat for now, we'll see how that works. And I will try to respond to them and I will certainly repeat them for everyone. And let's go see how this works for us, okay? So if you remember from last time, we were talking about page tables and this is a particular structure for a page table that I like to call the magic 10-10-12 pattern for 32 bits. And if you notice here, what you've got is 10 bits give you 1024 entries in the top level page table, 1024 entries in the second level page table and then pointing at four kilobyte pages. And what you can see is essentially that we've got four byte entries, page table entries. And so this all works out 1024 times four gives us a 4k page. So each of these pieces are the same size as the pages, which has that rather interesting effect that we can actually page out parts of the page table. The other thing is I pointed this to everyone here, here's the 32-bit structure that's actually used on an x86 processor. And what you see is the segment that you happen to be using comes from the instruction in one way or another. And then it is used to look up a segment selector, which has in it a 13-bit pointer to a segment plus a table ID. So this is showing you the global table. So what you do is you grab the segment ID and whoever is annotating there, please stop. And then what you see is basically that segment selector selects a descriptor. That descriptor becomes the base as well as we add the offset to it and we get a linear address and then that linear address is in a 10-10-12-bit address page table. So this is exactly what's used in modern architectures, especially at the 32-bit level. There's a 64-bit version of this as well. We also started at the very end of the lecture to start talking about or reminding you about caching. And if you notice, one of the things I was reminding you about from 61C is this average memory access time, which if you have a processor that's just talking to slow main memory at 10 nanoseconds and then you have a processor that goes through a cache to main memory, then assuming you have any locality at all and actually are able to use the cache, then you get a much improved average memory access time. So without the cache, you're essentially going to get 100 nanoseconds and it'd be nice to do better than that. And if you notice average memory access time, as you recall, is hit rate times hit time plus miss rate times miss time. And I pointed out that the miss time is the time to pull from the lower level into the cache and then to access from the cache. And so I gave you two possibilities here, one with a 90% and another a 99% hit rate. And in the first case, that average memory access time comes out to 11.1 nanoseconds, which is significantly less than 100 nanoseconds, of course. But if we have a 99% hit time, we get down to 2 nanoseconds, which is even smaller. So it's really all about can we get things to actually be in the cache, right? And the reason we were talking about that, of course, is we're talking about this rather expensive process of translating from a virtual address to a physical address. Just looking back here, I mean, you go, here's a move instruction to an address and you have to look up in a segment descriptor table and then in the couple of levels of page table and so on. Boy, that's going to be really slow unless we can somehow cache it. And so that's kind of where we were picking up last time. And if you remember, again, this is the more extended hierarchy. The idea here is we have big things on the right and small things on the left. The big things are slow because they're big. That's something in physics. The small things are fast, again, because of physics. But if we do caching well, we get the access time to look like it's got the speed of things on the left, but the size of the things on the right. And we can approximate that if we have a good locality. OK, so the new thing we've kind of added in all of this is the page tables are actually in memory. So they might be on disk or a main memory or even in some level of the caches. And we're going to put the TLB on the registers here or near the register speed in order to somehow get things to be faster. And that's kind of what we're looking at. And how do we make address translation fast? Well, we basically look at the process where the processor basically sends reads through the memory management unit that get translated, sent to the cache, and then down to memory. And wouldn't it be great if, once we went to all the trouble of translating these translations where we have a virtual page become so physical page, wouldn't it be great if we pull something out of the page table in physical memory? We cache it somehow near the MMU. And so then the next time we look for it, it's fast. All right. OK, so that's our goal for today. And were there any questions on what we covered last time? I wanted to pause for a second. Please use the group chat if anybody had something to ask. OK, so what we're going to talk about now is we're going to talk, remind you a bit about how caches could be structured. And we're going to try to figure out how to actually structure this particular type of cache called the TLB translation look-aside buffer. And so let's remind a little bit. This was a very end of the lecture. So the TLB itself is a cache of translations. And if a translation is already present in the TLB, then you grab the physical address without ever touching the page table. And so this is how we're going to hope to avoid going all the way down to DRAM memory all the time when we translate. And the idea of a TLB was actually invented by Maurice Wilkes prior to caches, amusingly enough. And so people decided that if it's good for page tables, why not for the rest of data and memory? And as a result, a cache was born. The other thing is when you invent something, as I mentioned at the very end of the lecture last time, you're allowed to name it. So it's called a translation look-aside buffer. And we'll continue to call it that. And Maurice got to call it that. I guess if he called it Fred, as I mentioned last time, we would talk about how Fred was feeding up our translations. But the key is going to be that on a TLB miss, potentially the page tables might even be in the cache. And so the access is maybe even somewhat faster than going all the way down to DRAM. So whatever we come up with, we would like to utilize all of our caching that's available to us. OK, so how do we apply this to address translation? Here's another figure that would maybe help us with this. So the CPU is generating addresses that are virtual. We check and see whether they're cached in the TLB. And if they are, we go directly to physical memory and look things up. And so the TLB effectively speeds up our access. So we're not going all the way down to memory for this translation. Of course, if we're unlucky, we have to go to the MMU. And then we save the result in the TLB and go forward. OK, so of course, we also talked about untranslated reads or writes that might be available to the kernel. Now, the question that you might want to ask yourself, since you all remember 61C very clearly, I'm sure, is there locality in this address lookup? Because if there's no locality, then there's no hope for us to get caching to work. And well, what about instructions? You can imagine that with loops and instructions, we tend to use the same pages over and over again. And so yeah, there's definitely locality in the instruction stream. The stack has a definite locality because you're sort of pushing and popping on the stack. And so you have a tendency to use the same pages over and over again. And data accesses have less page locality, but there still is some. You could imagine walking across gigabytes of storage if you were going through a huge array and that would have no locality. You'd spend all your time faulting in the TLB. But many other patterns are much better on the TLB. So the question here we have is just to clarify, there's a possibility of translations not in the TLB than the page table references may be cached. Yes, so the page table itself can be in the L3, L2, L1 cache. And so it's possible that when we're walking through the page table, we're also not going all the way down to DRAM in order to satisfy that TLB request. Did that answer the question, Andrew? So great. So let's continue. And so just like in regular caches, we could ask ourselves if we could have a first level, second level cache hierarchy. And the answer is certainly yes, because it's a cache structure just like before. So now moving forward, what kind of a cache is the TLB? All right, so we've got, if you remember, from 61c, we've got a lot of parameters. We can say the line size, which is how big each cache entry is. We've got the set size, which is what's the associativity. We've got data sizes. We've got the number of sets. And so there's a lot of parameters here. And so just for your edification, and so we're all on the same page, let's remember a little bit about caches. And then maybe that'll help us actually figure out what our organization is supposed to be. So if you remember, I think they probably taught you about the three C's in 61c. I'm going to actually talk to about the four C's, or the three C plus one, which I do sometimes. These first three actually came from a PhD thesis from Mark Hill, who was here at Berkeley at the time. So once again, this is a Berkeley artifact. And compulsory misses, first and foremost, are basically the time when you access something in the cache, but you've never accessed it before. So in theory, there's really no way for the cache, other than sort of predicting the future, to know that you needed that in the cache. And so that's called a cold miss. And you have to typically go all the way down to the next, you can go down to the next level of the cache or all the way to DRAM to fill a compulsory miss. The one exception to this is if you have a good prefetcher that's sort of recognizing a pattern in use, then it can potentially pre-fill the cache. Capacity misses are cases where the cache is just too small. And so you've loaded a bunch of things into the cache. And then by the time you get back to use them, the thing you put in there is gone. And the reason it's gone is the cache was just too small. That's a capacity miss. And the only way to really fix capacity misses is by making the cache bigger. A slightly more subtle type of miss is called a conflict miss. And a conflict miss, I'm going to remind you in a moment, but that's when the associativity of the cache is a little too small. And we have multiple cache lines overlap in the same physical location. And so when we loaded it in the cache the first time, we subsequently loaded something that kicked it out of the cache because of a conflict. It's not the cache size. And then when we went back, it's missing. OK. And last, which is sort of the plus one, is what's called a coherence miss. And a coherence miss is an invalidation that can show up if you have multiple cores, for instance, in a multi-core processor. That's an instance where one core might have loaded the value read-only. And if nothing else happened, it would cache really well, except that another processor did a write. And as a result, to keep those two locations coherent, typically what happens is that write causes an invalidation, which then invalidates the first read. And so by the time the first processor goes to look, it has to reload in the cache. All right. Did these remind everybody sound roughly familiar with what you remember? Any questions on this? Now, one question that sometimes comes up wasn't asked here, but is, so how do you tell the difference between a capacity and a conflict miss? And I basically gave you the description while the capacity misses through the size and a conflict is due to lower associativity. But in fact, if you were running a simulation against the cache, how would you tell the difference? And the answer is, if you have a series of accesses and you're going to run them against the sample cache to see whether it's the right cache for you, everybody needs the right cache in their life, right? What you would do is you'd run it against the cache that was just like the one you're testing out, but had a fully associative. So the only reason you'd ever kick something out is because it's too small. You'd find out what all the misses were. Those are the capacity misses. And then you'd rerun it with your actual cache and anything new would be a conflict miss. OK. So the question here is, associativity is the same as the amount of neighboring memory we bring in? No. So that's a type of spatial locality is what you're referring to. And hopefully, in the next few slides, we will have this straight. But if it's still not clear, if you could ask that again, that would be great. So let's look at what associativity is. And before we do that, I want to understand what a cache block looks like. So a block is kind of the minimum thing we can pull into a cache, all right? So whenever you get a cache miss, you go to the next lower level and you pull in a block. And that block could be easily 32 bytes. For instance, if it was an early processor, it could be 128 bytes and some of the more modern ones. And in this block itself, the address that you're going to access is divided into a couple of pieces. The lower set of bits are used to decide, excuse me, which byte within the block you're interested in. So in a 32 byte block, this is going to be five bits because the log base 2 of 32 is five. It's one of those things to put on your list of things to learn. Then the block address is kind of everything that's not the block itself. So that's the remaining, if this were five bits down here, that would be the remaining 27 bits and a 32 bit processor. And some of this is called the index, which decides what set we're in. And the other is called the tag, which helps us to decide whether something's in the cache or not. And don't worry, I'm going to remind you what those things are. But so the index is used to look up candidates in the cache. And the tag is used to decide exactly which of those options are available. And if there aren't any things that you're looking for in the cache that meet this block address, at that point, you get a cache miss and you got to go to the next lower level. So just to put this in perspective, a direct map cache is a particularly simple cache, which hopefully is triggering memories for you guys. The direct map cache has associativity 1. So there's essentially no associativity here. We're going to talk about a 2 to the n byte cache where the uppermost 32 minus n bits are the cache tag and the lowest n bits are the byte selector or block size is 2 to the n. And so if you look, here's our cache down here in the lower right corner. And what I've got here is 32 byte caches. We computer scientists always start things at 0. So we have 32 byte caches. And each one of these lines is a different cache block. So the first one, cache block 0 has 32 bytes. Cache block 1 has 32 bytes. Cache block 2 has 32 bytes, et cetera. Each one of those cache blocks in the cache has its own tag and a valid bit. And so really, when we're going to try to find out whether the thing we're looking for is cache, this is going to be a point at which we check the tag and if the tag matches, we will start accessing our data. So here's our address, it's 32 bits and I've divided it up into the byte select which is the lower five bits. That's going to tell us which of these bytes we want in the end. That's the lower five bits. The next five bits in this particular cache are going to tell us which line or cache block we're interested in. And so it's five bits because we have 0 to 32 or 31 here. And then the last part is the tag. And so here's what comes out of, say, the processor that says, hey, I want to look this up. The first thing we do is we use the index to look up into our static memory, static RAM, the cache block that we're interested in. And notice in this direct map cache, it's very simple. You just take the index. Let's say this has got a 1 in it. So that would be 0, 0, 0, 0, 1. That gets us the first block. And now this is the cache block we're interested in. How do we know if it's the right block? Well, first of all, does the tag field match the tag? And also, I forget whether I have this on the animation, is the valid bit valid? So either a mismatch here or an invalid will tell us that we don't have what we want. And then assuming that the answer is yes, now we use the byte select field to pick the byte of interest. So here the lower five bits are all 0s. And so we want this lowest byte in the block. And so this was a successful hit because we're assuming that this tag here matches to 50 hex. The cache index gave us this line. And the byte select then, because we had a match, gives us that first byte. So I'm going to pause here now and see if there are any questions on the direct map cache. Because this is our simplest. And it turns out the fastest cache we can build is a direct mapped one. Anybody have anything that they're curious about? Trying to remember why this works the way it does. OK. Nope. OK. Good. I'm going to assume we're good to move on. OK. Here's a question. The more bytes in each block means there's a higher spatial locality. Yes. So the more bytes that we have, so if instead of these being 32 bytes, these were 64 or 128 bytes, the reason that's got higher spatial locality is because when there's a miss, we pull more bytes in from DRAM, which means that if our byte select, this byte select here, when it's all zeroes, gives us this byte 32, and it's 0, 0, 0, 1, gives us byte 33, so on. The more bytes we load at once gives us a higher spatial locality. So yes, that answer your question. So that's why, in fact, modern processors tend to have something more like 128 bytes in their caches. And then down at the lower, here, yeah, n is 5. Very good, that was the question. So in this case, the n bits for the block, oh, excuse me. I take that back. No, in this case, you're asking what's n. This is n is 10, OK? Because the tag itself is the part that we look up. Oops, sorry, back it up. So in this case, the n is 10, m is 5. That, I think, is what you wanted, Daniel, the size of the cache block itself is defined by the fact that m is 5. Got it? Good. And notice, by the way, that the tag is the part of the address that doesn't directly access the cache. So the index is an index into the cache itself. The byte select pulls something out of a cache line. The tag is the part that we don't have. Now, what I wanted to say very quickly to everybody is notice this issue here that's kind of interesting. There are a bunch of addresses for which the cache tag is different, but the lower 10 bits are the same. I mean, after all, this is 10 bits down here. So we've got 22 bits up here. There are 4 million addresses that all share the same lower 10 bytes bits. And so as a result, all 4 million of those would try to map into the same cache line. And so if we went for a different address that had the same index with a 1 here in these 5 bits, we're going to kick this one out. And so if we go back to look at it again, this would be a great example in which we have a conflict miss. Because by conflict, something that matches the lower 10 bits are going to kick this one out, and we're going to have to refetch it later if we go back to it, that would be a conflict miss. So how do we improve on this? Now we have associativity. So this is getting back to that question a little bit earlier. What is associativity about? So what we're going to do here is we can potentially have an n-way set associative cache where there are n options. And the n options are all looked up by the same index. So I'm going to have n equal to 2. And what that means is unlike this previous one, where n is 1, and this cache index gives us exactly one candidate block, in the case of the associativity being 2, what we see here is hold that question for just a second, Andrew. What we see here is because we have an associativity 2, we've got two possible banks of caching that we can look up. And so the first thing we do is we use the index. So the answer is no. Not every capacity miss is a conflict miss. If you look here, when we have our cache index, it's going to pull two different items and two different candidate cache lines. And then we're going to check the tag with both of them. And if one of them matches, we compare, we get a bit that's high. And at that point, we select this data from that cache block. And then we can, last but not least, use our byte select to go after it. So I think the way to understand, again, the difference between capacity and conflict misses is literally if you imagine there is no possible conflicts, you start with a fully associative cache. You find out what all the misses are there and then you pull your associativity up and that gives you your conflict misses. The one flaw with that, which is I think getting to that question, the flaw with that is that sometimes when you have an associativity cache that's smaller than fully associative, which I'll show you in a second, sometimes misses that would have occurred in a fully associative cache end up not occurring. That's a little counterintuitive, but it does happen. So hopefully that answers your question there, Andrew. So why is this interesting? So what's good about this is now we have two lines with the same lower bits that will possibly hold data. And so our conflict, we'd have to conflict with two separate different lines before we kicked everything out. And so it's less likely to have conflicts when you increase your associativity. And as you increase your associativity, but say keep your cache the same size, what's actually gonna happen is your cache index is gonna shrink and you're gonna have more and more banks and together they'll add up to the same cache sizes before, but you're gonna potentially look up more candidates. And so you could take this direct map cache, pick a size here, this is one K, we could turn that one kilobyte cache into two 512 byte banks, still a one kilobyte cache, but two-way set associative. And then last but not least, we could do this trick, which is there's no index because this is fully associative. And what happens here is we could compare the cache tags of everything. So every line has a tag and we find the one that matches and that, and then once we've got a match, then we pick our byte out of that. And if you notice, there's no chance of any conflicts here because any line can be put in any slot here and so there's no conflict. Now you might very, I'm sure several of you out there are thinking, well, why don't we do this all the time? Why isn't it always fully associative? Can anybody guess what the answer to that is? What's the story on associativity? Anybody remember from 61C? Yeah, it's slower, high overhead, hit time is long, more expensive, good. Those are all good answers. Now, what's tricky about this is why is the hit time longer? And if you look here at this fully associative cache, you might say, well, I'm doing all the comparisons in parallel, so that's fast. Why should this be a slower option? And the answer is size. This is big in transistors and so everything's further apart. And so once you get a match, you gotta go further. So physics means that this is slower and not only the physics of the size of the transistors, but if you look here, there's this MUX. And so as we go closer and closer to fully associative, this MUX has many more inputs to get the right output and that's also gonna be a big structure that's slow. So as we add the associativity as it goes up, things get slower. And so you gotta do that trade-off. So the question is, what would I mean by you have to go further? What I mean is that if you have a circuit with a small number of transistors, it's gonna be a small area. If it has a lot of transistors, it's gonna be a bigger area because they take space. And so by bigger, that means that a signal that goes from one part of the chip to another, when the chip is bigger, it's gotta go farther and so it just takes longer. So that's a physics issue. Okay. The question of why, the other question is, why is the index placed in the middle as opposed to the upper bits up here? Anybody have a thought on that? Yeah, so this gives us a better, this gives us less conflict and the answer, that's a good answer. So the reason this is less conflict is, what this means is that we can have a set of things that are similar addresses that are close to each other that fit in the cache. If we put stuff up here, what would happen is that we would effectively get a direct map cache down below and for certain chunks of the memory space and it would be much slower. Okay, so let's see, I think that's good. Notice by the way that in all of these cases that I've been given you, the cache line is always 32 bytes and we're assuming we have a one K total size. And so what we're doing is we're rearranging the cache into pieces to get that to go. And assuming we get to that place later in the lecture, that will be interesting when we talk about ways of overlapping TLB lookup and cache access at the same time. Okay, so at Ms. trivia, I hope you all know this is very important. Saturday is Pi day and that's not, you can go out maybe and get some good banana green pie as well, but it is Saturday, March 14th is Pi day. And I thought I'd tell you a couple of things about Pi because Pi is one of my favorite numbers in the world. One thing that's very interesting is that 40 digits of Pi are sufficient to calculate the circumference of the visible universe down to atomic dimensions, all right? Is there a good place to get Pi in Berkeley? Many good places. So the interesting thing here is so 40 digits, which I show you right here in which you should all know, you know, 3.14159265358979338464333279428841971. And then of course it goes on 699375105820974944, et cetera. By the way, sorry to the closed caption or you don't have to put that in there. But the thing that's interesting about this is that, you know, even though 40 digits are way too many for any physical application, why do we like Pi? Well, the notion that it's an irrational number that goes on forever has basically been, has been basically an issue for every, for a long time. People think, well, are there any patterns in it? There was even a book by Carl Sagan, science fiction book where if you went out far enough, there was actually a pattern of ones and zeros that made a picture in Pi. But anyway, the thing I wanted to also point out is the best formula for Pi is this very cool one from Ramanujan, where one over Pi is equal to, I won't even bother, this incredible summation. And last year on Pi Day, Google announced that one of its employees, Emma Awau, basically calculated Pi to, amusingly enough, 31,415,926,535,897 digits. So that was a new record. And anyway, and yes, you guys can all round Pi to three, although 3.14 is pretty good. So some real administrivia now. So don't forget your period valuations, okay? It's very important that you fill them out. And this is as important as getting to know your TA. And I'm assuming that most of you have all done this, but we have a few stragglers, so it's very easy to do. We've talked a lot about that before, but make sure to do it. Project two design docs were due yesterday, as you may know. And the thing I wanted to point out is since the design reviews are an oral exam, they are still mandatory unless you're sick or there's a good reason for it. You need to try to make sure you go to your Zoom call with your TA for the design reviews and make sure that you find a time you can all attend. And ideally, if you've got a system with a camera on it, that would be great as well because your TA wants that it's easier for people to talk in a small group. But so please plan on attending those. Those are mandatory virtually, of course. And I know I saw a bunch of questions on Piazza earlier today about not having the Zoom links from their TA. I apologize, I think we ended up posting a bunch of them and if you still can't get them, keep reaching out to your TA as they should be available soon. And so you've probably all talked with a lot of your friends and neighbors at the widely varying types of midterms that people have tried over the last week. There's been a huge amount of discussion going on in the faculty lists on how to do exams. Technically speaking, people being in school in person is only suspended until after spring break. I'm not entirely sure what's gonna happen. I did say somebody say that UCLA is giving A's to everyone. Well, we'll take that under advisement. I'm not sure what's gonna happen but you guys should all study as if the exam is happening of some form, it'll probably be more likely to be virtual than not. But material up to lecture 18, at least some material from that is certainly fair game. So along these lines, you should also try to go to your discussion sessions. See what is going on there because that'll help you with some of the material as well. All right, are there any questions on this? Is the exam cumulative? That's a good question. Technically, everything we do is cumulative because if we can't write questions that easily assume you forgot everything from the beginning, we try our hardest to focus directly on the material that's new but we assume that you'll still know what was going on from the previous midterm. There is a good question about any tips on meeting with project groups now that being in person is generally harder. I think one of the things to do is to make this, the final is similar to what I just said for midterm two. So there is no final, there's only midterm one, two and three and those will be focused primarily on the new material. But anyway, so tips on meeting with project groups. I think you should try to make your virtual meetings must see TV, so to speak and just pick a time that everybody's gonna be together and just meet up. You Zoom, yes. If you've got Zoom or some other, I think Google Hangouts works pretty well for small groups and everybody should have access, I believe to Google Hangouts, whatever works. I think this is enough of an unprecedented situation for everybody pretty much everywhere on the planet right now. So whatever works for you guys, let's do it. And what would be great is in fact, maybe we'll start a piazza post or somebody could say things that work for keeping groups together during virtual code, virtual classes. So anyway, I don't have anything else to suggest on that front. I think use every tool you can come up with and just be safe and wash your hands frequently. Remember, you sing happy birthday twice. I suppose you could recite pie to the last digit twice while you're washing your hands, that would probably be long enough as well. And so that will try to stay safe and I won't keep you up to date on the piazza post as to what's new. I think what's good is at least Berkeley didn't give you guys five days to move out of the dorms or else like Harvard and a couple other places did. Anyway, any other administ trivia questions? I realize this is tough, everybody. Let's see where we're going. Try to keep the material interesting and keep it flowing. I may have to come up with some interesting items from the world of computers for every lecture. Maybe I'll figure something out to try to keep the interest up. But in your group meetings, try to keep the interest up as well. Okay, should we go on? Professor Weaver style, sure. Okay, we will possibly do something like that. Okay, here we go. Oh, you wanna hear about the priority inversion bug? Yes, that sounds good, not today, but we'll talk about that. Thank you for reminding me. Okay, so moving down back onto the material here, where does the block get placed in the cache? So I wanted to just show you direct map, two-way set associative and fully associative just to give you kind of an idea of where things might go. So here is an address space with 32 blocks in the cache. And what you can see is, for instance, or here's the address space, it's got 32 blocks in the address space itself. If your cache has only got eight and it's direct mapped, the way you decide where a block goes, for instance, block 12 is it's 12 mod eight, which is the number of cache lines and that essentially gives you where it is. So basically sort of every eighth cache line is gonna be a conflict in a direct mapped cache. Okay, and a two-way set associative, 12 mod four tells you that pretty much any, excuse me, every fourth cache line can fit in one of these two places. Okay, so that's for the same eight block cache. And then last but not least with fully associative, we still have eight blocks, but any line can fit in any cache. All right, so are we good on that? Alrighty, so, and by the way, that modulus question has to do with the parts of the address on the bottom that hold the index and the block ID. So where should a block be placed on a miss? So for direct mapped, it's very easy. There's only one possibility, but once we start thinking about this, if you notice this fully associative example, what's interesting about this is any new cache miss has eight places it could go. So how do you decide where to put that? In the direct map case, you only have one place, so that's easy, fully associative, not so clear. And so one option is basically least recently used. You say, well, pick a line that was least recently used and go with it, and that's the one you throw out. Or you could do random selection. What's interesting about this is you can see, for instance, that if you're doing cache access, that the miss rates for a workload between doing LRU and random really don't make much of a difference once the cache is big enough. If you notice the difference between these two options here, LRU and random, isn't really very much when you have a large cache. That's because the miss rates are low and then randomly picking one, there's enough space in the cache that you're unlikely to get a lot of conflicts. What's interesting about this particular comparison is this depends a lot on how expensive it is to miss in the cache. So your cache penalty, thinking about a TLB for a moment is the cache penalty high or low? What do you think? Okay, I have a plea for high. The answer is yes, because you gotta go and walk the page table basically when you miss the TLB. So the TLB is starting to get kind of expensive to do a random replacement. And so you could imagine wanting to in fact do some sort of LRU for the TLB, especially since the TLBs are small. Okay, and so the small TLBs, because they're in your pipeline, as you see the smaller you get, the more advantage, there is some advantage to doing things. LRU has got a slightly lower hit rate here and so miss rate, I mean slightly lower miss rate. And so there is some advantage to LRU. So it is computing a random index and curse some non-negligible time overhead for this. Now, when you're talking about the index, you're talking about just the index in the address. The answer is pretty much no. So randomness, all you have to have is just a counter that's spinning really rapidly and you just freeze it at the moment you're doing the replacement and you've got a random number. And so there really isn't any overhead there. LRU is more interesting and if you're ever curious, maybe when I get my office hours going again, I can tell you how you can actually implement LRU in hardware, it's pretty interesting, but it gets expensive pretty quickly. So anyway, so bottom line from this is as you get bigger caches, randomly picking something to miss might be okay unless your penalty is high. What happens on a write, is MRU ever used in industry? Pretty much not. All right. So yeah, I haven't seen MRU used on caches per se. So the problem with that is basically you're likely to reuse something. So if you have any temporal locality, kicking out the most recently used thing is almost guaranteed to cause a miss if you have any locality at all. MRU is something you might use in more of a FIFO type thing, which there are some uses for that, but not necessarily for caching. So what do you do on a write? So there's two options you can do write through or write back. So write through says that you write the cache item and the underlying memory. Write back says that you put your data only in the cache and it's only when you replace the block that you write it back. So giving this picture for a moment, what it means is if I have write through, if this item is cached and I write to it, I also write to the underlying DRAM right away. In write back, I put it only in the cache and the memory now becomes out of date and it's only when I replace that I write back, okay? That's called write back. And so the issue there about, excuse me, whether when you want to do one or the other, so the pros of write through is read misses. You never have any writes and so you can throw them out quickly, but the processor is held up on writes unless the writes are buffered so that can slow you down a lot. Write through tends to be used in the upper levels of the cache, assuming that you don't go off the chip. Write back means that repeated writes are not sent to the DRAM over and over again, but the cons are more, it's more complex. You gotta might have to write the dirty data back when you have a miss. So if you think a little bit about it, we haven't gotten to how we actually use virtual memory yet to page, but you can imagine that if the virtual memory is thought of as a cache on top of the disk that we absolutely don't wanna do write through because that would mean that every write of the processor would have to go to the disk, which is gonna be bad. But what it does say is that when we do writes to pages in memory, whenever we replace a page, we have to first see if it's dirty and write it back to the disk before we get rid of it. So we're gonna have to mention that, especially when we get into our page replacement algorithms a little bit toward the end of the lecture and next time. Okay, so any questions? There's your very quick reminder of what caches are from 61C. I'm gonna assume that that went well. Okay, the question is how is data consistency handled if write back is implemented in a multi-core system? That's a great question. And the answer is there are many ways. The simplest way is if you actually have a bus on the chip, then the bus can notice every time there's a write, it would notice that I've got a similar write that I've already done earlier. In that point, everything gets frozen and the first write goes back to memory before the second write's allowed to go. So that's with a bus. That doesn't scale too well. So what typically happens to give consistency both for dirty data and for eliminating reads that are gonna be out of date is some sort of directory cache coherence protocol, which I think that actually I'll talk a little bit about that later in the term. But there are several ways of keeping consistency and it's basically either broadcasting what's going on so everybody can keep their stuff up to date or keeping a directory that sort of has the information of who's got dirty data and who's got read-only caches that need to be invalidated. Hopefully that helps. All right. So what's the impact of caches on operating systems? Many things, okay? So as I mentioned last time, everything in an operating system pretty much is a cache. So one of the things the operating system needs to do is deal with cache effects. So maybe maintaining the correctness of various caches. For instance, one of the things that isn't typically invalidated automatically like I mentioned previously for caches just a moment ago is the TLB. And so if you change the page table, then oftentimes you've got to go hunt down the TLBs that might have incorrectly cached the value and you've got to invalidate them. And that's called TLB consistency and that's often handled in software. Another thing is just the existence of caches mean that when you schedule processes on a multicore, perhaps what you need to do is make sure that you schedule a process back on the same core was running before to take advantage of the caches on that processor. If you have really large memory footprints, it's possible that if you schedule too quickly, this is related to a question that was on last midterm, what happens is you schedule something, it runs for a little while, then you schedule something else, it gets kicked out again, you schedule it back. And because you're scheduling so quickly, you never get the advantage of loading up the cache with all the state that's required to get good performance out of the process. And of course, this is also not just with processes but deciding how to schedule threads. If you interleave too rapidly, you may degrade cache performance and degraded cache performance, by the way, anytime you increase the number of misses, of course the average memory access time goes up, as we mentioned. And then one of the other things is you need to often design operating systems data structures so that they tend to not degrade with cache performance. And a great example of that, which is called false sharing, is if you have a data structure that's so big that it crosses a cache line and it's used by multiple processors that even though the two of them are using, well actually, even though the two of them are reading and writing that data structure, it may be bouncing around between processors or if you have a page that's being only partially used by one or the other, it can be bounced around. So you have to really take advantage of your knowledge of what the cache looks like to try to prevent cache lines that are being mostly used at a given time on one processor from being interrupted by something happening on another processor that may be essentially unrelated. So this is an important part of design, but we're not gonna worry about that right now. Right now what I'd like to talk about is sort of what TLB organization makes sense. So if you look at kind of the way we've been describing things here is that the CPU talks to produces virtual addresses which basically give you a, goes through the TLB, which then goes to the cache or then goes to memory. And so what you can see from this very diagram here is that if memory is DRAM and slow and you've done a huge amount of work to make sure that something's in the cache, then so that the CPU can do a fast loader store, the problem is that if you're not getting through the TLB really rapidly, then you've slowed everything down again. So you'd like your TLB to not get in the way of very fast access to the cache. And by the way, I did wanna say something more here. I realize I said something incorrectly. There's sort of said two different things here and I wanna correct that. So false sharing is in particular a case where the cache line is a little bigger, but you have two little data structures that are on the same cache line. I'm sorry that I misspoke there. And basically what happens there is if processor A is using one part of it and processor B is using the other, the cache coherence protocol can be bouncing that cache line back and forth even though those two processors aren't sharing that data at all. And so you have to be very careful about how things align on the cache lines. And that's called false sharing where the coherence system thinks it's a share, but the two processors aren't actually sharing information and everything gets slower. All right, so let's talk about now this TLB. How do we make this really fast? Because it's gotta be kind of like it's half of a cache access or even faster than that. And so it's in the critical path of memory access. So how the heck do we make this fast? And thus it's basically adding to the access time and reducing cache speed. And in some cases, if you do this poorly, the whole cycle time of the system gets slower and therefore your CPU is just plain slower. And this is kind of arguing somewhat that the TLB ought to be direct mapped or have very low associativity. However, you have to have very few conflicts in order to make this not walk the cache a lot. And so that tends to argue for high associativity or even direct mapped. And so the question is what are you gonna do? Pretty much that kind of argues that the TLB ought to have as high associativity as you can because the conflict misses are high. But to keep it fast, you'd make it small. And so if you're wondering why TLBs are often very small but have high associativity, it's really because they need to be fast and they need to avoid conflict misses. And those two things kind of go against each other. And because the TLB tends to be a bit smaller, you can end up with thrashing situations, especially with modern operating systems. And modern operating systems like to have many address spaces, one for each process. We briefly mentioned micro kernels at the beginning of the class. In that situation, kind of the file system has a separate address space from the network stack from the windowing system, et cetera, because those are all running at user level. And in that case, if your TLB is too small, you can actually get thrashing going on where as you switch from process to process rapidly, you keep kicking things out of the TLB. And so we got to be careful. And so you kind of want at least three-way associativity here to sort of have a TLB entry for the code data and the stack. And furthermore kind of going with our previous question, what if the high order bits are the index? Well, what happens there is the TLB ends up being mostly unused for small programs because the index at the top, they don't vary much, though those high bits don't get changed much in a typical program. And so putting the index bits at the high and basically means that you waste whole chunks of your TLB. So we want to put the indexes low where they are kind of in our previous diagrams. So how big does it actually have to be? So it's usually small, 128, 512 entries. It's getting a little larger, but it's still pretty small. And that way we can do higher associativity. We've got, they're usually organized, as I said, as a fully associative cache where the lookups by a virtual address and it gets back a physical address. And I'll show you that in several sort of animations. I'll help make that clear. What happens when the fully associative is too slow, then you can start putting a very small, fast, direct mapped or low associativity mapped TLB in front of a bigger direct map one. And that's called a TLB slice. So that's like us having a second level cache. And here's an example of the MIPSR 3000. So that's a very, an old five stage pipeline, kind of like the ones that they're returning to these days and 61C and so on. But I thought I'd bring this up just to show you. So each entry in the TLB has a virtual address that it's mapping the physical address page that it's maps to and then a bunch of the access bits and including an address-based ID which we're not gonna talk about now. But you can imagine that what happens is the virtual address comes out, we extract the virtual page ID, we fully associatively match it with one entry, say this one, what comes out is a physical address which is then combined with the offset. Okay. So here's an example of that pipeline. The reason I went back to this one is because this was back in the days of simpler pipelines where we could explicitly manage our transistors much better. And if you notice here, what happens is because we have a translation both for the instructions and for the data access, if you remember the five stages from 61C, that was the instruction fetch, the decode or register lookup, the ALU, the memory, and then the write back to the registers. And these are macro cycles and then we actually have half cycles. And so what happened in this processor was the instruction fetch, the TLB would be looked up in the first half and then we'd have a whole cycle for looking up the instruction. And at the end of that whole cycle which overlaps the end of the instruction fetch cycle in the beginning of the decoding, we'd look up in the register file which was often just half a cycle. And then the ALU, if it was computing a memory address, they try to do that in the first half of the cycle so that we could do the TLB lookup in the second. And then we would have our address for the memory cycle. So this is a situation where literally the TLB is in the direct path of performance of the various caches. And so the TLB has to be very restricted in how long it takes. Otherwise we're gonna have a poor cache cycle time. So are there any questions on that? So this is the simplest example of a TLB integrated into a pipeline. I just wanted to show you this so everybody could see. Are we good? Okay. All right, so then if you look, how do we get the translation time even further? Well, what we've talked to you about here is the CPU kind of puts the TLB in series with the cache itself. And so what we're really saying is the virtual page numbers pull out of the address. We look it up in the TLB, we get the physical page number, we merge the offset, and that's our physical address. But wouldn't it be great if we could overlap TLB lookup and cache lookup, all right? So I'm gonna show you a little magic now because that hopefully on the face of it should sound a little strange to you guys because what do we look up in the cache? We look this thing up, the physical address, okay? In this particular cache we're talking about. And so how can we do that? Well, if you look at a virtual address and you look at the physical address from the standpoint of the cache, there's this index and byte down below and only these parts of the address have to be used to look up in the cache. This part is just the tag. And so if we can figure it out, if we can arrange so that the virtual address offsets and the index byte portion of the cache take the same number of bits, then we can be doing the cache fetch at the same time that we're doing the lookup of the TLB with the virtual page number. And as a result, we can do them in parallel, okay? So let me just show you how that might work here. So notice that we take our address, we do an associative lookup on the virtual page ID in parallel with grabbing the index, looking it up in the cache. We get our bytes back from the cache, including the tag, we do a comparison and by the time we get our tag, we've also finished our TLB lookup and so those can happen in parallel. Now, don't worry too much about this. I'm just giving that as an interesting example to people who might find that curious. But there are interesting questions. What if you want a bigger cache? Well, the problem is that a bigger cache, direct mapped will move the bits into this virtual page number. And so the trick with that is a bigger cache, you could have two 4K banks of an 8K cache. So if you increase the associativity, you can still do this in parallel. Or even if the index kind of moves into the virtual page ID, you can do interesting things where you look up to part of the cache lookup first and then wait for the address to come back. So all right, I'm gonna leave that. That was for you computer architects in the audience. So just a very quick, some examples. So 2003, this is the Pentium MTLBs. There are actually four different TLBs. The instruction for 4K pages had 128 entries, four-way set associative for large pages, which you might use for the kernel. It actually had two entries fully associative. And that would be mapping a small number of very large pages like you might have in the kernel. And the same for data. And there's an LRU replacement policy and you might say, why different TLBs for instruction data and sizes? Well, the access patterns for the instructions and the data are a little different. And there are different parts of the pipeline that we might be looking them up. And the page sizes are gonna have different access patterns. So here's another one, the Intel Nehalem. We actually had 64 entries for 4K pages and 32 entries for the big pages. The instruction TLB had 128 entries and 14 fully associative entries for the big pages. And then there was a second level cache for TLBs that was unified between the two. Finally, this is an example of a very recent processor. If you look at this guy, it's actually, you can see there's a TLB kind of up here. There's a TLB kind of in the middle and then there's a unified TLB down below. And so we actually have, here's the data one down here. So this is the data TLB, the instruction TLB, data TLB and the unified TLB. So these are actually in a modern processor in this way. And if you look at the actual numbers, what's interesting about this is you can see, for instance, that's 128 entries, eight-way set associative, 64 entries, four-way set associative. And then the second level TLB has a lot of entries and it's very associative. So even in these modern processors, these upper level things that are trying to be fast at cycle time are small, all right? And that's just because we have to do that in order to make things fast. So what happens on a context switch? So that's question. The address space just changed. So the TLB entries are no longer valid. So these entries, whatever they were, the instruction and the data TLB, they're mapping the address space for a given process. And so if we change processes, suddenly we got to do something because the TLB entries are incorrect. And so what are our options here? One is you invalidate the TLB. So you could basically reach out and mark every entry in the TLB is invalid. That might be expensive, but in a lot of processors, that's your only option. Okay, because if you don't do that, you're gonna get wrong translations. A more modern option that showed up starting about seven years ago in more frequency is that the TLB entries are also marked with a process ID. That's part of a context which you change the process ID. And as a result, when you change from process A to process B, process IDs different. And so the TLB entries for process A are just ignored and they even stay in there if you don't kick them out. And so this is nice because you don't have to flush out your TLB. However, if the translation tables change, in that case, the hardware isn't gonna help you much. And so you really have to invalidate the TLB entries. And for instance, if you had a page that somebody thought was in memory, now you've got a problem because it's no longer in memory. And so you might have to change things. And that's called TLB consistency. As I mentioned earlier, you gotta go out and flush the TLB. Okay, so let's put it all together. Here's our address. I got you a 10, 10, 12 pattern. Here's our memory. If you look, page table pointer. Okay, CR3 as I mentioned on the Intel points at the first level page table. We grab our index that gives us a PTE that points to a second level page table. We use our second level index which gives us the final address. Physical page number started with, this is the virtual page number. Everything in red up here gives us a physical page number. This is, how many bits? This is 20 bits. This guy is also 20 bits when we're done. We merge the offset together. That points at a page in physical memory. Physical page points to a page in physical memory. And since this is 20 bits, that means there are two to the 20th or almost a little more than a million possible pages in physical memory that that could be mapping to. And if you think about it, the offset is now gonna say what byte or bytes in that page am I interested in? Okay, now of course this is potentially slow. And so now let's talk about the TLB. So here we have a physical page number. And what happens with the TLB is we take these 20 bits. We fully associatively look it up. Let's say in the TLB cache, that gives us the physical page which we put in here and we've got a fast path. Okay, and then finally just because we can, let's throw the actual data cache in this. So the data cache. So this address, this physical address, we think of it as having index and bytes and a tag, a byte address, index and a tag. And so that index looks it up in the cache and assuming that the tag matches, then the byte can be used to pick what byte you want. All right, so there you go. Those are all the caches that we've been talking about so far in this lecture. All right, any questions on that? Alrighty, everybody good? Now, so if you remember, we had two critical issues going on in address translation. What do you do? Well, first of all, how do you translate the address as quickly enough? And we've kind of, I think we've mostly put that to bed. The next one was what do you do if the translation fails? Okay, and this is a page fault. All right, and we need to talk about what happens there. This is a synchronous exception. And if you remember from the very beginning of class, I talked about traps and interrupts, which are ways, sorry about that, I keep pulling that guy down there, which are ways in which we go from the user to the kernel. And they're actually traps and interrupts are separate things. So a synchronous exception or a trap is something that occurs because an instruction has hit a faulting point. Okay, this might be a divide by zero or an illegal instruction or a bus error, whatever bad address. Those are synchronous because if you think about it, if you divide by zero because of a divide instruction and you don't change anything and you try it again, it's always gonna happen at that same spot. So that's why it's a synchronous exception. A segmentation fault, where it's an address totally out of range or a page fault for illusion of infinite size memory, any of these things also cause synchronous exceptions because they're happening in response to a load or soar. These synchronous exceptions can't be disabled like interrupts, okay? Because if you were to try to disable them and you were to hit one, then what? The processor is basically unable to make forward progress. And the best you can do, I guess, is just crash the whole processor. So instead of that, basically synchronous exceptions or traps are not disabled and you have to make sure that the operating system always knows how to handle the current set of synchronous exceptions that might be possible. Interrupts are asynchronous. They occur kind of between instructions and could occur between any instruction if the right interrupts enabled. And so examples of this we've talked a lot about already this term are things like timers and disk ready and network, et cetera. And interrupts can be disabled, okay? And that's, we've talked a lot about disabling them. In fact, certain interrupts can still be enabled when others are disabled. There's many options there. But this, so those are the two really key differences. Synchronous exceptions can't be disabled and happen synchronously with instructions. Asynchronous exceptions can be disabled and happen between instructions. And on a system call, exception or interrupts, any of these cases that enter into the kernel, then of course we enter kernel mode with interrupts disabled because even when you're entering with a synchronous exception you're kind of in a fragile state there until you save enough registers. So all of these kind of disable interrupts, save the PC and then jump to an appropriate handler in the kernel. And the handler does any required state preservation before interrupts are re-enabled. So the reason I'm bringing this slide back up to again is because we're now talking about synchronous exceptions for page faults. And to that end, we need to have a concept that is important if you were to take one, or 252, either of those, but it's also very important for operating systems, which is the notion of a precise exception. And so we're now, again, we're talking either synchronous or interrupt synchronous or asynchronous exceptions. And the idea of a precise exception is that when you enter the kernel, the state of the machine is exactly preserved as if the program executed all the way up to some one offending instruction, all the previous instructions are completed, the offending instruction and everything afterwards active if they have not even started. And what's good about this is that the system code when it takes over, let's say because of the page fault, knows exactly what instruction caused the exception and where to restart when it, if it say page the data back in. So having a precise exception is very important in order to really making page faults work. And it's very difficult in the presence of pipelining and out of order execution, but modern computer architects have figured out how to make that happen. Imprecise is basically anything that isn't precise. And there have been processors that have a lot of different imprecise exceptions and they're very messy. And usually in order to restart an imprecise exception, the operating system has to do a huge amount of work. And that huge amount of work is just something you don't want on a page fault because you don't wanna have to figure out where things fail, all right? And so performance goals may lead designers to go for imprecise exceptions, but and often on mathematical things that couldn't be restarted anywhere anyway, but system software developers don't like that because they have to figure out what to do about something like that. And so I will point out just in case you're worried about this that modern techniques for out of order execution and branch prediction pretty much fix this issue. So that anytime you get an exception now on a modern processor that does out of order execution and branch prediction, you just get a precise exception. So this is a big win for operating system folks like us because we don't have to worry about thinking of the processor pipeline and what might be going on in there at the time of the exception in order to restart a page fault, okay? And architectural support is hard. And so architects don't always get it right. And it's kind of funny, there are many examples of this. You're welcome to talk to me at other times like the original M6800, which was in a lot of printers and stuff had paging, but it didn't say the fault address properly and you had to reconstruct a bunch of stuff in the operating system, very painful. The original Sunworks stations with the Spark processors had two fault addresses that represented the point of exception. And you had to restart the pipeline in a particular way to make that work properly. So that was always tricky as well. So the page fault is a synchronous exception and it's a precise synchronous exception. So this is good. So what does that mean? It means the virtual to physical translations fail. The page table entries marked as invalid for some reason. Maybe it's a privilege level violation. Maybe it's an access violation. These are all reasons that we could end up with a failed translation and it causes a fault. And it's not an interrupt because it's synchronous to the instruction execution. I hope I convinced you all of that. It might occur on instruction fetch or data access and protection violations typically terminate in a way that's restartable. And that's because they're precise and that's good. So then the page fault does what? Well, it enters a page fault handler which engages the operating system in fixing to try to retry the instruction. So it might, for instance, allocate an additional stack page if we've pushed the stack into an empty spot of the stack or maybe it makes the page accessible that we might use and copy on right after a fork or it might bring the page in from secondary storage which is demand paging, pull it in off the disk. So protection violations that can't be resolved. What do you do? Well, you terminate the process and oftentimes this is where you get a core dump which is a terminology from ancient history when memory was cores, little round lifesaver like pieces of metal with a magnetic one or zero. And so segmentation fault cord up, exactly. Now, fundamental inversion of hardware software boundaries kind of going on here. And what do I mean by that? I mean that a hardware instruction faults and then the only way for that hardware instruction to make forward progress is for the software to take over and do a bunch of stuff and restart that instruction. And so this is kind of an inversion. Usually you think hardware gets to do its thing and the software runs on top of it, but here this is a little different. Okay, and so what's gonna happen? Well, here's the good case, right? Instruction has an instruction address that goes to the MMU which translates, gets a page number out of the page table which gives it a frame number and then we take the offset and that lets us do the access. And so that instruction is running at full speed and everything's great. But let's try a different example. Instruction goes to the MMU which goes to the page table and there's no entry for this thing. At which point we get a page fault. The instruction is the precise exception point and the page fault enters the kernel exception handler and now we have a page fault handler. And assuming for a moment that the data we want is really on disk somewhere, then what will happen is that page fault handler loads the page from the disk, updates the page table entry to point at it and then puts the task back on the scheduler, on the ready queue and sometime later the instruction's retried. It tries it again and it succeeds. Okay, and so this idea that the memory that's over here in the far right acts as a cache on a much bigger space in the disk is exactly what we think of in terms of paging, okay? And that's kind of what we're gonna be talking about next. And so we talked about this hardware software inversion where in order for the instruction to complete it has to have intervention from the operating system which handles the page fault and then restarts the instruction. All right, so demand paging which I've just kind of shown you as a type of caching. So in that instance, in that what do I mean by a type of caching? Well, we found that something was missing and we pulled the data off of disk and put it into the memory. So in this instance, the memory is acting like a level of cache on top of the disk. So what is that cache in terms of its properties? So what's the block size? Well, the block size rather than being 32 bytes like we were talking about is 4k because it's a page size. What's the organization? Is it direct map, set associative, fully associative? Well, if we do it right, it's actually any page can be placed anywhere in physical DRAM and so it's fully associative. So the question is, can you have something in the cache that's not in the DRAM? Okay, so we need to be careful because the word cache is a very overloaded term and there are many caches, okay? And so in the many cache view of the world, basically this DRAM is a cache on top of the file system even though there's also a SRAM cache on top of the DRAM. So there's multiple levels of cache and that's actually that multiple levels of caching I showed you earlier where I said, we're trying to get the speed of the thing at the far left but the performance of the big things at the far right, excuse me, the speed of the thing at the far left but the size of the thing at the right. And so this is just part of that caching where now when I say caching, I'm talking about the DRAM as a cache on the disk. I hope that helps that question. So this is actually a fully associative cache because we can put, in this instance, we just showed you here when I pull this page off the disk and put it in DRAM, I can put it anywhere. So this, if DRAM is a cache on the disk, this DRAM is potentially fully associative and the reason for that is that the TLB lets us translate any page number, any page number, virtual page number into a physical page frame number. And because we can do any translation, this becomes a fully associative cache. All right, how do we locate a page? Well, we first look it up on the TLB and then we do page table traversal if we need to and that tells us what virtual address maps to physical addresses. What is page replacement policy? Is it LRU, random, whatever? Well, this is gonna require a lot more explanation. We're gonna get into page replacement policies, including the clock algorithm next Tuesday, but we really need to be as close to LRU as we can because the cost of going to disk is really expensive. And if we were physically in the same room, I'd say how much, how expensive? And you'd all yell a million instructions worth of expensive, right? And so we're absolutely gonna try to get our miss rate on this particular cache as low as possible. What happens on a miss? We have to go fill from disk. What happens on a write? Clearly it's right through. When we write to this cache, we're gonna write in the DRAM and only if we replace something then we have to write back to the disk. Okay, so here is that picture I showed you again and this is going to that question about the cache. Notice that there are a bunch of caches on chip, second level caches off chip. There's DRAM and then there's disk and maybe even tape or SSD, whatever fits in between here. And so what we're trying to do is come up with something that has the speed of the things at the far left with the size of the things at the right. And that's how we're gonna use caching. And so in this particular type of caching, which we call page paging or demand paging is attempting to do that where the main memory is a cache on top of disk. Okay, and so we talk about caching kind of on chip where the hardware manages the cache paging when the software is entirely managing this cache, deciding sort of what things are on disk and what art. Okay, so we'll pick that up next time. So in conclusion, we talked again about principle of locality in all of our discussion all day today, temporal and spatial locality. If you remember the temporal locality is that if you use something, you're likely to use it again soon. And if you spatial locality says that if you use something you're likely to use something close to it, the way you get good spatial locality is you make larger cache blocks. And in fact, on the disk, we have a 4K size cache block. We talked about the free plus one major categories cache miss, compulsory misses, conflict misses, capacity misses and coherence misses. And then we also talked about cache organizations, direct maps set associative, fully associative. I did notice that there's a comment on the group chat about can you have something in a cache which is not in DRAM? There is a possibility to have what's a non-inclusive caching where things in the cache aren't necessarily in the lower levels of the cache but that's usually for the SRAM only. And finally, we talked a lot about the TLB. Small number of PTEs and optional process IDs typically fully or very close to as fully associative as you can afford from a timing standpoint. On a miss you get a page table traversal and potentially a page fault. And on a change in the page table you might have to validate things. We talked about a precise exception which is a precise exception point where all previous instructions have completed and no following ones nor the actual instruction started and we can manage caches in hardware or software. All right, let's see. I think that's it for today and talk to all of your folks in the class. Let's see, next week we're starting. Like I said, I have a little bit of a bet with another faculty member or two who shall remain nameless that we can get our class attendance way up to the 50% rate. So let's see if we can do that. So if each of you talked to two of your friends or three of your friends we could be way over the top. So thank you for coming and I hope you all have a great night and we'll see you on Tuesday. Ciao.