 This is lecture 10 of Computer Science 162. So our topic is caches and translation look-aside buffers today. But before we get into that, I want to do a recap of paging and segmentation-based translation. Let me turn this down a little bit. And then we'll go through multilevel translation and look at caching, what it is, sources of misses, and how we can organize our caches three different ways. And then I'm going to talk about translation look-aside buffers as a way of speeding up the translation process using caching. And then we're going to look at how we can combine caches, memory caches, with translation look-aside buffers into a virtual memory hierarchy. OK, so let's start with segments. So here we have our virtual memory view. So this is what a process sees. And here we have a segment map that maps us from this virtual view to physical memory. So the first stack here is segment 11. And so that's mapped to a base of 10, 11, and then quadruple 0 with a limit of 1 quadruple 0. And so that maps us to this physical location. Similarly, our heap is segment 10. And that maps us to 0, 1, 1, 1, 0, 0, 0. It's right here in physical memory. Same thing for our data. That's map. That's segment 01, starting at location 0, 1, 0, 1, quadruple 0. So that's this region here. And then finally, our code is segment 0. So that maps us to segment 0, 0, 0, 1, quadruple 0, which is down here. So now we can ask questions like, for example, what happens if our stack grows? So our stack is growing down. So our stack grows down here. Well, the problem is there's no room for our stack to grow here. So what's going to happen is we'll end up with some kind of segmentation fault, pass it to the kernel, and then the kernel will have to do something, like, for example, move other segments around so that it can make room for the stack to grow. Segments have to be contiguous in physical memory. So as an alternative that gives us a much simpler allocation mechanism, we looked at paging. So with paging, we take our virtual memory and we divide it up into fixed size units. So unlike segments, which could be variable size, pages are always a fixed size. We pick it for the particular architecture. So it might be 4 kilobytes, it might be 16 kilobytes. It depends on the virtual memory architecture, but it's fixed. So all of our allocation is done as a page. You can't allocate less than a page. If you want to allocate more than a page, then you just simply allocate multiple physical pages. So our allocation now is very, very simple. And the way it works is we have a page table. And the page table maps our virtual page number. So in that case, it's this red portion of the virtual address. We look it up. So here for our stack, we'd look up quadruple 1, 0. And quadruple 1, 0 maps to triple 1, 0, 0. So that's right here. So now we can ask the question again, what happens when our stack grows? And here it's very easy. Even though we don't necessarily have a lot of room here, we only have one free page, we can allocate our pages anywhere. And then just simply make the page table point to the appropriate location. So no longer do we actually have to have our physical regions be contiguous. The only continuity we need is it's allocated as a single page. But those pages could be scattered throughout memory. And in fact, for a typical machine, after you've run a bunch of programs on it, you'll find things are allocated all over the place. But it doesn't matter. From the point of view of the program, it has a nice contiguous virtual address space, which we allocate on demand. So there's lots of blanks here. And when we touch one of those, we'll have to allocate real pages to match. So the challenge here is the size of our table is going to be equal to the number of pages that we have in use that we could have, rather, in virtual memory. So especially when we get to very large architectures, like a 64-bit virtual address space. If we had to have a page table for that that was all in memory, it would be larger than the size of memory that we could afford to buy. Any questions about segments or paging? Yes. So the question is, what do we do when we have 64-bit address spaces? We use multi-level translation. That's the solution. So we're only going to store entries for those pages that are in use. We can't do that with this approach, right? This approach, we need a full contiguous table that table has to be actually stored in contiguous memory. Any other questions? Okay. So let's look at multi-level translation. So the idea here is rather than just have a single allocation scheme, like segments or pages, why not have multiple? We can do one, we can do multiple. So we can build a tree of tables, basically. So the lowest layer could be a page table. The advantage of doing this at the lowest layer is pages are a nice, easy unit of allocation. And then higher levels could be segmented. That's one scheme, to have segments at the top level, pages at the next level. But in reality, we could have arbitrary number of labels. So let's say we have a segment top level and then pages at the next level. So here, our virtual address is composed of a virtual segment number, a virtual page number, and an offset within a page. Okay, so remember with paging, since our lowest level is paging, our offset is an index into the page. There's no addition as we have with segments. It's something that always students get confused with. With segments, we have to add the offset to the base. With paging, the offset is relative to the start of the page. So there's no addition that has to happen. Okay, so what are we gonna do with the rest? So first we need to find the appropriate page table. We're gonna find that by looking up the virtual segment number first. So we take our segment map pointer, which is a register, which points to our segment map and memory. And we're going to index into that segment map with our virtual segment number. Okay, so we do that, and in this case, let's say it's the third entry. So that now returns the location of a page table. And so before the segment was giving us the actual data page. Now it's giving us a page table. The page table has size limit two. So now we're gonna take our virtual page number and we're gonna index into that table. So we look up the virtual page number in that, and that now gives us a physical page number. Since our lowest level scheme is paging. So now we have a physical page number that we can use. All right, are we done? Not quite. Yes. Okay, very good question. What is the purpose of limit two? The limit here tells us how large this page table is. So the reason why I said we're not quite done yet is we need to check all the bits and the sizes and everything to make sure we actually had a valid translation. So first of all, we have to check to make sure that this virtual page number is not larger than what would be allowed by limit. That is we ran off the end of the table. We also have to check to make sure that this segment is actually a valid segment. Some of these are not, like this one's not valid. Then we have to check the page table entry here. So we actually have to check to make sure it's valid, and we have to check that we're actually accessing it the right way. So for example, this one's marked read write, but if we tried to write to page zero, we'd have to generate an error because that page is marked read only, okay? Everybody follow? Okay, so what do we have to save on a context switch and what do we have to restore? Well, if we have a scheme like this where our top level is a segment, then we have to save this segment map pointer register, which tells us where to find the segment table and memory, the segment map, okay? If our top level was a page table, so if we had a page table on top of page table, and I'll show you a picture of that in just a few minutes, then we'd have to save the pointer to the top level page table, right? That would be a register also. So in both cases, we need to just save a register in order to do our context switch and then restore a register for the new address space, yes. Ah, so what's the benefit of multi-level translation? So again, if we go back to this example, the challenge here is this table is a contiguous table and the size of this table is it has to have one entry for every page that we could have in the virtual address space. So imagine we have a two to the 64-bit address space, you know, 64-bit address space, so two to the 64 bytes that we could potentially address, and that's broken up into pages that are, I don't know, say 12 bits, right? So we'd need 64 minus 12, so two to the 64 minus 12 page table entries, that's a lot, right? That's two to the 52nd, which is a really, really big number. We could not, that would consume all of our physical memory that we could potentially have and then some. So even though most of that 64-bit address space would be empty, so what multi-level page tables allow us to do is they allow us to just have page tables representing, or segment entries representing the portion of the address space that's in use, right? So large portions here are not in use, right? And the size of this table only has to map the portion of that segment that's actually in use, right? So if we have segments that are not in use, don't need any, or if we're using a page table on top of page tables, if we have regions that are not in use, they don't have to be mapped, they don't take up any entries. Only when we actually use them do we actually need entries, yes. So the amount of memory, again, is going to be proportional to the number of pages that we have in the virtual address space, right? So you can just calculate the size of a page, the number of pages that you have, and then you need some number of bytes for each page, table entry, like four bytes. It's gonna take a pretty big chunk of memory, okay? Even for a 32-bit address space, yeah. So base three, that's correct. So if you tried to enter, if you tried to reference the segment for base three, you would generate a segmentation fault because it's marked as invalid, okay? Similarly, if I tried to reference page number six in base two, I would also generate a fault because I'm running off the end of the page table, and that's an invalid entry. In fact, I can't even access that because that would be reading a memory location that's not mapped by this segment, right? This segment only maps a memory region this size. Another question, yeah. Okay, yeah, very good, yeah, very good question. So where do these tables live? So the top-level table always lives in memory, and then the lower levels can be paged out to disk. So this means we only need to potentially keep a small amount of our tables in memory, the active portions. When we get into paging, we'll see this in more detail. Yeah, yeah, sure, you first. Yeah, the top-level, whether it's a top-level page table or a top-level segment table, that always has to be in memory, but it can contain information that tells you, oh, this is located out on disk, so go fetch it from disk. We'll see a lot more about this in Wednesday's lecture when Professor Candy talks about paging. And then you had a follow-up. Okay, so the question is, when we wanna add a new page, how do we know where we can put it? So we have to maintain a bitmap for all of our physical memory, and it has a bit that tells us whether the page is in use or the page is not in use. If the page is not in use, we can allocate it. So it's much, much simpler than with a segment, a pure segment-based approach. With a pure segment-based approach, we have to maintain these free lists of pages, of memory rather, of varying sizes, and then when we need a memory region of a given size, we have to try and see, do we have adjacent regions that we can coalesce that are free, or do we have to move things around? So segments are just very, very expensive to deal with as a lowest layer. So no systems today use segment at the lowest layer. The bitmap will be, so the question is the size of the bitmap, it's gonna be proportional to the number of physical pages that you have. And so that's a reasonable size. The largest you might have in a modern machine is a terabyte of RAM organized in, say, 16K pages. So you'd need one bit for each one of those pages. Any other questions? Okay. All right, so sharing. We can now do sharing at various granularities. So for example, we could share an entire segment very easily. So in process A, we have our virtual address here and it's gonna point to its segment table here which maps to a base two in this case to a segment that we wanna share. In process B, we simply have our entry point to the same shared segment. It is not a requirement that they have the same virtual segment number. It could be any of these that we actually map. It could be base six that gets mapped to it. Doesn't matter, right? What does have to be equivalent is limit. The limit value does have to be equivalent, right? But this is a very easy way where now these processes could communicate between each other using this shared memory and they could do that at very high bandwidth, yes. Yeah, that's a good question. Could you have different size limits so that you could have an exclusive space at the end of it? No, you wouldn't do that. You'd have a separate segment that was exclusive, say, to process A. Yeah, all right, question is, do both of them have to be valid? Yes, they both have to be valid. Again, it gets more complicated when you're doing paging and if you paged out the page table, then you'd have to mark them both as invalid and on disk. So you have to do reference counting. It gets a lot more complicated when you do sharing of memory regions, yes. Ah, so the question, that's a good question. How do we prevent sharing? So sharing has to be mutual so they'd have to both register with the kernel that we want to share this region between. So one example might be a parent might create a shared region and then fork a child. So the child also has a handle to the shared region. There's a follow-up question. So again, if they're both not valid because there really is no valid segment, then you can't share. If they're both not valid because it's actually sitting out on disk, then that's completely reasonable and the page fall handler will just simply pull the page in when either one of them references the page table. Okay, so another very common, in fact, what most systems use is two-level page and hierarchy. So here we have our page table pointer and it points to a top-level page table. We're gonna say our entries here, our page table entries are four bytes in size and our virtual address now is split up into two sets of virtual page numbers. The first set refers to the top-level table. The second set refers to the second-level page table and then we have our offset as before. So we've taken our 32-bit address and we've split it up into 12 bits of offset, 10 bits of virtual page number two and 10 bits of virtual page number one. All right, so now our tables are fixed size. They have 1024 entries, 1024 entries and you could have probably figured that out on your own looking at this, how? What's the size of a page? How many bits on a page or for indexing into a page? 12, okay? What's two to the 12? Sounds like a good exam question, two to the 12. 4096, okay? What's 4096 divided by a four byte page table entry? 1024, okay? So that's the size of these tables is 1024 entries. Don't be surprised if you see a question like this on the midterm, all right? Also, we might say, you know, how do you construct your virtual address division between offset virtual page number two and virtual page number one, right? How many bits do you allocate to each other, right? Well, in this case, if there are 1024 entries, then you need 10 bits to the 10 is 1024 in order to reference it. So that means each one of these has to be 10 bits, okay? All right, so we use the page table that we get from the top level page table from our indexing it with the virtual page number one to find our second level page table which we index into with our virtual page number two and that gives us our actual page. We use the offset to index into that page, okay? So that's two level pages. Now, we need valid bits on all of these page table entries because we don't need every second level table, all right? So this is really good for sparse address spaces and even when they exist, these second level page tables can be out on the disk if they're not being used, all right? So we can page them out, all right? So two different schemes that we can use, segments and pages or pages and pages for dealing with very large sparse address spaces and this is the most common one in use today, yes? Okay, what's the difference between segments and pages and pages and pages? So a segment is a variable size region whereas a page is a fixed size region. So each one of these entries here points to a fixed size page table, a 1024 entry page table, one page, all right? Whereas with segments, we could point to a variable size page table because we could specify the limit to tell us how large that page table was. Okay, so here's a summary of how paging in two level page tables works. So let's say I want to resolve the following address, right? So we've broken up our address into an italicized red page number one, a non-italicized green page table, a page, virtual page number two and in blue is three bits of offset. Okay, so if we want to resolve this address, we're first going to look up the first three bits, one zero zero in our top level page table. So that's going to be right there, that's going to be that one, okay? Then we have to read our net, that gives us a pointer to our second level table and we're going to index into that with these two bits, the one and zero in green. So that's going to be this entry here, so we're going to read that entry. That's going to give us the page table entry that's going to give us the physical page number of the actual data, okay? And so that is right there, all right? So big challenge here is for, so the best case is our total size in terms of the number of page tables is going to be the number of pages used by the program in virtual memory. So again, great for sparse address spaces. The downside is we have to do two additional memory accesses for each reference, right? In order to read this location, we had to read this entry in this page table and we had to read this entry in this page table. So that's kind of expensive question. So in this case, the question is what is each entry here? Typically it would be a page table entry, so it's going to be some number of bytes. Typically it could be four bytes or eight bytes. In physical memory. It's going to be very architecture specific. Different architectures are going to store different bits. We'll see when we get into paging that we have, was it recently accessed like a used bit? We have dirty bits, we have all sorts of bits that will maintain. You could also have access control bits like it's read-only or you can write it or it's execute-only and so on. Question? Ah, so the question is, if I want to access multiple addresses on this page, do I have to perform the translation each time? So far as we've seen, yes. So our machine is going to run substantially slower than we would like. We'll figure out, if I ever get to it, we're going to figure out how we can use caching to remember that operation. So we're not just going to use caching to remember the contents of memory, but we're also going to use it to remember a process like address translation. You know, we start out with this virtual page number, one zero zero one zero, and we end up with this physical page number, one zero zero zero zero. So we might want to remember that if we're accessing that page a lot. Question? So each entry here in each of these page tables points to a page, in this case, in physical memory. It need not reside in physical memory though because we can page it out to disk. Okay, so some of the advantage. So the advantage here is we only have to allocate as much in terms of page table structures as we're actually using, you know, it's going to be proportional to the amount of usage that we have of our virtual memory. So sparse address spaces are trivial to do with this and they're going to be space efficient. Very easy memory allocation because at the lowest level we're using pages. So we don't have to worry about best fit or buddy fit or, you know, some other kind of algorithm and compaction and moving things around. We'll never have to do that. We get very easy sharing. We could either share at the segment level or we could share individual pages. But of course, if we're doing sharing, we have to do reference counting so we can figure out when it's okay to deallocate a memory region. But there are a lot of disadvantages. So we have one pointer per page. So that's typically, you know, for four to 16 kilobyte pages. So when we start to have machines that have a terabyte of memory, that is actually a lot of pointers that we could potentially have to maintain. These page tables and especially when those are 64 bit address machines, these page tables have to be contiguous, right? But the examples I showed you, the page table is only one page. But page tables could be definitely much larger than a single page. And we have to do a lot of memory lookups. So we're doing three memory lookups for each reference, including the reference itself. So that's very expensive, right? We have to look at the first level table, look at the second level table, and then we could actually read the thing we wanted to read. So that's not good. Okay, so in summary, segments are good because we can do very fast context switching. The disadvantages we have external fragmentation. Paging with a single level doesn't have external fragmentation and has very fast and easy allocation. But the downside is the table size is gonna be proportional to the virtual memory size. So this doesn't work for large address spaces like 64 bit address spaces. Then we looked at two different multi-level schemes. And here now the table size is proportional to the amount of virtual memory that's actually in use. That's good. And we get fast and easy allocation since the lower level is pages. Downside, lots of memory references to do each reference to memory because we have to read either the segment table or we have to read the top level page table, then we have to read the second level page table, then we actually get to read the thing we're looking to read. So very easy solution and the rest of this lecture is all gonna be about this solution. So caching is just a way of having a copy that we can access faster than the original. So the goal here is to make the frequent case very fast and the infrequent case cost thus less dominant. Because the infrequent case is gonna be when we have to go and fetch the original, that's gonna be really expensive. We don't want that to dominate our access cost. So we can cache at many different levels. We can cache memory locations. We can cache address translations. This is what we're gonna talk about today. We can cache pages. We can cache file blocks. Pages we'll see on Wednesday. File blocks and file names and network routes and so on. Everything in a computer relies on caching for good performance. From the system level, from the chip level, all the way up to the application level. Now caching will only work if the frequent case is frequent enough. If it's infrequent, caching is not gonna help. And if the infrequent case is not too expensive. If the infrequent case is ridiculously expensive, caching isn't gonna help. Because in those rare times that you go and have to do the infrequent case, that's just gonna dominate your total time. Now an important measure for a cache is what is the average, for a memory hierarchy or for any system involving caching, is what's the average access time? The average access time has two components. The first component is the hit rate times the hit time. So that's when you find it in the cache, what's the time to access it in the cache? Plus, the miss rate, so when you don't find it in the cache, what fraction of time, times the cost when you have to fetch the original. So example. So let's say we have data in memory and we don't have a cache. Our access time to memory is 100 nanoseconds. So our average access time, 100 nanoseconds. So let's add a cache. So we're gonna add a second level static RAM cache much faster than dynamic RAM. We're gonna say this has a time to access of 10 nanoseconds. Our time to go to main memory is 100 nanoseconds. So our average access time, again, is our hit rate times our hit time, plus our miss rate times our miss time. Key thing, the hit rate and the miss rate have to sum up to one, because either we find it in the cache or we don't find it in the cache and we miss. So let's look at different hit rates. So what if our hit rate is 90%? What's gonna be our average access time? Any clues? So what is 0.9 times our hit time, 10 nanoseconds equals nine. And our miss rate is one minus our hit rate. So 0.1 times 100 is 10. Okay, so we end up with nine plus 10, 19. So we're doing much better than 100 nanoseconds, but we're not doing quite as well. We're at almost double our access time for the cache. Yes. That's a good question. So the question is, do we also need to include the time to look in the cache? Absolutely. So I'm assuming in this case that the time to look in the cache is included in the time to access memory. And there are many ways you could do that, like by doing it in parallel. Was there a question in the back? Okay. All right, what if we increase the hit rate to 99%? What's our average access time going to be now? So it would be 0.99 times 10. All right, so that's going to be 9.9 plus 0.1 rather times 100 nanoseconds. So we'll end up with one. So 9.9 plus one, 10.9. So now we're getting much, much closer to performing like our cache instead of performing like our slow memory. Now we can take this and look at it in the context of a computer. And a computer, it's not just one hierarchy, rather it's an array of hierarchies. We have registers, which we can access in fractions of a nanosecond, L1 caches, which we can access in nanoseconds. And as we go up the hierarchy, sizes go up pretty dramatically, right? So we go from having a few hundreds of bytes that we store in registers all the way up to terabytes that we can store in a disk, right? But accessing the disk is taking tens of thousands of nanoseconds versus fractions of a nanosecond. So our goal here with caching is we wanna give you basically as much memory as we can in the cheapest technology. So that would be disk is really, really cheap relative to everything else. Memory is more cheaper rather than on-chip memory and so on. But we wanna give you the access speed of the far left. So we want you to have the illusion that you have huge amounts of space that you can access at chip speeds. So that's what caching is all about. It's sort of perpetuating this illusion. Okay, now, can we actually do this? Or does caching actually work? The answer is yes. And the reason why is one word, locality. So if we look at our address space, let's say this is our virtual address space and we look at the probability of a reference. It's a sort of what fraction of our references are going where? We notice a peak, right? In fact, we notice a couple of peaks. These have two, there are two aspects to these peaks. One is temporal locality. So if we've accessed something recently, we're more likely to access it again in the future. So that's locality in time. The second is spatial locality. And that is if we access something, we're likely to access something near it in the future. So that's locality in space. So typically we'll find in any system, if caching is gonna work, we either have temporal locality or spatial locality or we have both. So again, our illusion here is we want to move contiguous blocks from lower memory to upper memory so that we get this illusion that we're operating at the upper memory speeds. And if we've accessed something that's in our upper memory, we're likely to act in the past, we're likely to access it again in the future. And we're moving by block because if we access something near something, we're likely to access something near that in the future. So our unit of transfer will be these blocks. And we'll see this when we look at caches and cache lines. Yeah, no. So the question is when we have a page table and it's mapping pages to opposite ends of the address space, does this break, for example, spatial locality? So we'll have spatial locality within the page. And when we cross a page boundary, it doesn't matter where that physical page is located in memory, right? There's not all pages in memory or equally in physical memory are equally far from the processor. So we'll pull, we can pull them in a block at a time. Or if we're pulling them in from disk, they're sort of all equally far away. Or we think of them that way, especially with SSDs, that's true. Okay, so when does caching not work? Well, caching doesn't work when we don't find the item in the cache. So let's ask a question. Why might we not find an item in the cache? Any ideas? Yeah, right, so exactly. The cache is already full, so we would take what's called a capacity miss, right? The item's not in the cache because it got bumped out of the cache by something else. I'll be another one, yeah. Yes, so if we've never accessed something before, it's not gonna be in the cache. So that's actually called, that's our first miss. That's a cold start or compulsory miss. So the first time you reference something, you have to load it into the cache. So when we look at across a program's execution, its entire execution, we're never gonna be able to have 100% hit rate because of compulsory misses. Now, if we're running billions of instructions, cold start misses don't really matter because they're gonna happen in the beginning and then if we look over a window of time, we could actually see 100% hit rate. So our second one is the one that was mentioned in the back, which is capacity. So we can't contain all of the blocks that are gonna be accessed by the program. The solution here is just to simply make our cache larger. There is a solution actually to compulsory misses, which is we could just preload everything. But the downside of preloading is that's gonna delay us from starting execution. And it's usually faster to just page in on demand. Next set of misses is conflict or collision misses. So this is where multiple memory locations map to the same location in the cache. And the solution here is to make the cache bigger or to increase the associativity of the cache. So we'll see this when we go through the different organizations for cache. And then two other types of misses, which we're really not gonna look at as much in this class. One is coherence misses. And this is where an entry in the cache gets invalidated because another processor or another core or an IO device modifies the item in memory. So we have to invalidate it. The last one is policy miss. And this is where we pick the wrong item to replace when we have a capacity issue. And we pick the wrong item because then it gets referenced short time while, short time later. An optimal policy would have picked something else to replace. So there's a bunch of questions we can ask when we have a cache. So let's say we have an eight byte cache here. We have a 32 byte memory and our block in this case is gonna be one byte. So our unit of transfer from memory into the cache is gonna be just one byte. We're gonna assume the CPU accesses the following location. Zero one, one zero zero. So first question we wanna ask is, is that byte cached? So here's the byte in memory. Is it stored in our cache? The second question we wanna ask is, if it isn't stored in our cache, where would we put it in the cache? And we have eight choices of places to put it. And then the third question we're gonna ask is our replacement policy. When the cache is full, which byte do we pick to evict? And so three questions that we'll ask of any cache organization. Whether a byte's cached, where we put a byte in the cache and when the cache is full, what do we evict? So first example should be one that you're familiar with from 61c, which is direct mapped caches. So here, each byte is in physical memory is cached to a single location. And we're gonna use the least significant bits of the address. So the last three bits are gonna tell us where to put it. So what this means is these four addresses, 00100, 01100, 100100, 1100, are all gonna get cached to the same location in the cache. They're all gonna be stored in the same location. Now, how do we know which one of those is cached? That's the role of the tag. The tag is going to be two bits that tells us the item is the one we're looking for. So 01100 is our tag, and that's our upper most significant two bits. Becomes our tag. Okay. Now, how do we know which byte is cached? Well, the tag is storing the entire, oops, I got one other thing. When we're, the cache is full, which cache byte do we evict? And that will again be 100. So that's direct mapped. Should be familiar to you, hopefully. Fully associative should also be familiar to you. So in this case, we can store a byte at any location. So the tag has to store the entire address. So here it's gonna store 01100. So how do we know whether an item is cached? We simply look through all of the tags, and see if we find a matching tag. Where do we place an item that we want to cache? We can place it anywhere. If the cache is full, which cache byte do we evict? That's gonna depend on our replacement policy. And there are many different replacement policies that we could have. Any questions? All right, so some administrative stuff. So Project 01, code is due tomorrow. I think we only have something like eight groups that are passing all of the public auto grader tests. So congratulations. If you're one of the other almost, what is it like, I think 28 other groups, hopefully you are working away on your own test cases to try and figure out why you're not passing all of the auto grader test cases. Now, a very, very, very important caveat is there are eight public cases. There are a lot more private cases that we're gonna actually use for grading your assignment. So, simply passing all eight does not mean you're gonna get 100%. You wanna make sure you are doing good testing to exhaustively test your code. Question? Oh, I think there's something like, there's like, I wanna say like 20 something total test cases, I think. So, eight out of like 25 or something like that. So, test early, test often. So, if you're thinking of using slip days, please don't. I mean, they're yours to use, you have four, but I would strongly encourage you to save your slip days for later in the semester when you're gonna be a lot busier and the projects are gonna be a lot more complicated. So, they're yours, use them wisely. Design doc is due and group evaluations are due Wednesday by midnight. Very important, your design doc incorporates the feedback from your TAs and the changes that you've made. Group evaluations are anonymous to your group. So, no one else in your group can see them, so please be honest. If people did extra work, say that. If they didn't pull their weight, say that. We'll say this multiple times throughout the semester, but somehow or other every semester we have one or two groups where perfect division of points up until project three or sometimes project four, in which case they're suddenly in imbalance then they come and they complain that someone in their group hasn't been doing any work all semester long. So, we'd like to see these problems early. We can deal with them proactively if you guys tell us early. So, those are anonymous, please be honest. We have a midterm coming up in two weeks. It's going to be divided between this classroom and 2060 Valley Life Sciences Building. It's closed book, double-sided handwritten notes and no calculator, smartphones or other technological devices. You'll cover lectures one through 13. So, that's 13 is the Wednesday lecture before the Monday midterm. And the readings, the handouts and projects one and two. So, you should be familiar with all aspects of the project. The TAs are gonna hold a review session. They'll spend part of the time reviewing the material but they're also going to expect you to bring lots and lots of questions. There are, there's more than a decade's worth of midterms online. So, there is no excuse not to do well on this midterm. And you can see that many of the midterm problems are not identical but are testing similar kinds of concepts. So, don't be surprised if there's a CPU scheduling problem like lecture eight. Any questions? Okay, so with that, I think we're gonna take a little bit shorter break, like a three minute break. Okay, so someone actually asked a good question which is the division here is by last name. Okay, so let's dive down now and look at the actual details of what it looks like inside a direct map cache. So the first thing is we take our virtual, we take our address, okay, so we have a 32-bit address here. This is a physical address. And we're gonna have the lower bits be byte select, lower five bits. Then we're gonna have a cache index and we're gonna have a cache tag. So very similar to the simple example except now we're actually going to look at how it's laid out. So we're going to use the cache index to figure out which line we store it in. So this selects a cache block. Then we use the byte select to tell us which of the 32 bytes is the byte we want. So five bits, two to the five is 32. And then we use the cache tag to compare against the tag of this entry. To see if this entry is actually what we're looking for. Remember with direct map, there's only one place it can go. And so if this does not equal, then we know it's not stored in the cache. We also have to check the valid bit. So we're gonna do a comparison first and then we'll check the valid bit. If they all match, we have a hit and we have the byte we want, okay? So we can return that byte. So again, data that has the same cache index is gonna potentially cause us to have conflict misses. Because they'll all map to the same line of the cache. Yes, where are the cache tags stored? This is all stored in the cache. If it's an L1, L2, or L3 cache, that's actually on the processor. L1 and L2 are in the same core. L3 is shared between multiple cores. The cache tag is, yeah, it's kind of like the name, it's the unique identifier for this address and this entry. Why is it so huge? Because the size of the cache, the number of cache lines is gonna dictate our cache index size. And the size of our cache block is gonna dictate the size of our byte select. Everything else is gonna be tag. So if we make our block wider, then we'd have more bits for the byte select. If we added more lines to our cache, we'd make our cache index wider. And then the leftover bits are used as the unique tag to identify that entry with that physical address. All right, now it's gonna get complicated. So we looked at fully associative caches where we could store anywhere. We looked at direct map caches where we could store it in one location. But as with everything in hardware, there's a middle ground. And the middle ground is what's called a set associative cache. So with a set associative cache, what we're gonna do is it's typically called an n-way set associative cache. The way to think about it is that we have, if given an n, we have n direct map caches that all operate in parallel, okay? So we have our byte select, our cache index, and our cache tag. Now what we're gonna do is, instead of having a single place where we can store an item, we have, in this case, is a two-way set associative cache. So we have two different places where we could store it. We could either store it in this set or we could store it in this set. So our set is made up of two cache lines. So now what we're gonna do is we're going to compare the tags in parallel, right? So we've now got two comparators for our two-way set associative cache. So we're gonna take our cache tag here. We're gonna take our cache tag, send it over here for that comparator. And we have to check the valid bits. That goes into our little AND gate. And we have to actually, so then if it matches, we're gonna use the byte select to select the appropriate byte. And then the mux will tell us, so here we hit on the left side. And so in our set zero, so the mux selects this one, or actually I guess that's set one, so it selects this one. And that's what returned along with the indicator to say we had a hit. We found it. Okay? This is the one that always confuses everybody. Two different places we can store it. If it was four-way set associative, four different places where we could store each index. Okay, eight-way, eight different places. Yes. That's correct. The actual, the index tells us which line the bytes that we're storing, these bytes that are stored in the cache block, those came from memory. We read 32 bytes at a time in from memory. Now, why do we read 32 bytes at a time? Spatial locality. Because if we've accessed one byte, we're likely to access bytes around it. So better to load 32 bytes at a time rather than loading one byte at a time. Okay, so we have eight-way set associative. We could go, we could just take that to infinity. We made it completely associative. That's the fully associative version. Now in the fully associative version, we have our byte select. We get rid of the index because there is no place that we look to find it, we have to look everywhere. So now our cache tag is really long, it's 27 bits. We don't have the five bits of index anymore, or four bits rather of index anymore. So now we're gonna compare the cache tags of all the cache entries in parallel. So here we have our block, byte select tells us which byte, we have our valid bit, and our cache tag now has to be compared in parallel with all of the entries. So if our cache has 512 entries, we need 512 comparators for our fully associative cache. So the downside with this is the advantage here is we can put it anywhere. So we don't have to worry about collisions. But the downside is we need a lot of hardware. So that can actually make a fully associative cache especially if it's big, be slow. Because you have to take this tag and you have to mux it out to all the comparators, then you have to mux back the answers from all the comparators and actually select the appropriate line. So people typically don't build very, very big fully associative caches, maybe 64, 128, 256 entries at max. Okay, so where do we place a block in the cache? So here we have block 12 in our 32 block address space. We have an eight block cache. If we had a direct map cache, we would put it into one location. So for example, 12 mod eight. So we'd put it into location, what is that location in four? With a fully associative cache, we can put it anywhere. We're gonna use the tag to determine where it's actually stored. In a set associative cache, it could go anywhere in set zero. So the key thing is when we compare different cache organizations, we always have the same total number of cache entries. So in this case, we have an equal number of entries, eight entries. In the fully associative, there are eight choices. In the two-way set associative, there are two choices. In the direct map, there's a single choice. Question in the back? Yes, that's correct. Yeah, this is actually a four-way set associative. Okay, so which block do we replace on a miss? So for direct map, it's really easy, right? We only have to replace that one block. For set associative or fully associative, we have a couple of choices. The most common ones that people use are random and LRU. Yeah, question? Oh, the one here. This is a two-way set associative. There are two places within each set that we can store it. Yeah, there are four sets, but it's two-way set associative. Yeah, that gets very confusing very quickly. Okay, so we have a couple of choices. The most common choices that people use are random or LRU. LRU, least recently used, is attractive because typically if we've recently accessed something in the past, we're likely to access it again. Random is attractive because it's fast. It's incredibly easy to implement and there's no state necessary. LRU, we have to keep track of when did we actually last access this item, so that can get very expensive. So typically we don't use LRU in hardware, we use LRU in software implementations. So here's an example of some TLB miss rates. This is just an application that someone ran. So this is running against a simulator of how would the TLB perform with different degrees of associativity and with different sizes and with different replacement policies. And a couple of immediate takeaways. One is we notice that the miss rate is higher for random than it is for LRU. Random's just picking an entry and then shooting it to replace it with another entry, whereas LRU is trying to be a little bit more principled and taking into account temporal locality. So we expect LRU to do better, although as we get to larger cache sizes, it actually turns out random does much closer because the odds that we're gonna evict an entry that we're then gonna go and reference are much, much smaller in the larger cache sizes. Also as we make the cache size larger, our miss rate goes down, although there are gonna be diminishing returns. We go from 64 to 256 and the decrease is relatively small versus the decrease that we got from 16 to 64. We'll see more about this when we talk about working set sizes later on in the semester. We can also see that as we increase the associativity, we also drop the miss rate. So these are gonna vary, these numbers are gonna vary depending on the application that you pick and its requirements for memory and its memory access patterns. Now, what happens on a write? We have two choices. When we have a write, we can either write through, this means that when we do the write, we actually write the write into the cache and we write it back to memory. The other alternative is write back where we just store the updated information in the cache. And then we only write that modified cache block back when we have to evict it. So now we have to keep track and be able to answer the question, is this block clean or dirty? Because if we wanna replace it and it's dirty, we have to write it back to memory. So pros and cons. With write through, the advantage is that if we have a read miss, it's not gonna result in a write. A read miss might cause us to evict someone. With write through, we know every block in the cache is clean because we're writing through. So the contents of memory match the contents of the cache. The downside is that processor writes could be delayed because we need to write it through to main memory. And so we're gonna need big write buffers to make that work on the processor. With write back, the advantage is if we're writing and updating the same counter value quickly, we don't actually have to reflect that to DRAMP. So we're writing at 10 nanosecond speeds instead of 100 nanosecond speeds. The processor also is not held up on writes. All the writes get absorbed into the cache. The downside though is this is much more complex because we have to keep track of our things clean or dirty and now eviction requires potentially writing a modified block back. So read miss that causes you to have to evict something to have space in the cache could cause you to be blocked while you wait for the write back to occur. So two different trade offs. And the choice of policy is gonna depend on what you think your workload is gonna look like whether you implement a write back or write through. Okay, so let's look at how, so that was, yeah, question. Right, so the question is write back bad if our machine fails, right? Because you lose the writes. Well, if our machine fails, in this case, we'd lose our cache and our processor but we'd also lose main memory. If this was writing back to the disk, absolutely, then that's a risk that you have to figure out whether you're willing to take is if you're writing back to non-volatile storage like a disk or an SSD, then you do have this window of vulnerability between when the data is written back and any power failures or other crashes that occur, yeah. Sure, so the question is could you be adaptive and have some interrupt driven write back process? Absolutely, so when we look at the file system we'll see how the buffer cache gets managed in a manner where periodically a timer interrupt goes off and you write back things to the disk to try and reduce the amount of data that's volatile in memory, yeah. Current processors usually use write back because you get performance and you don't have to worry about faults because it's shared fate. If the machine crashes everything's lost, not just the cache. Any other questions? Okay, so we've looked at caching, now we can apply and pull together everything and apply caching to address translation. So remember this 621c picture, it's now gonna get modified. So we're gonna now take our virtual address spaces and we're gonna look in a translation look-aside buffer which is a cache of translations and if it's cached then we have our physical address and we're done. If not, we're gonna run the whole translation process so use our memory management unit to walk the page tables or the segment page tables or the page tables and page tables and save the result in the translation look-aside buffer and then we have our physical address again. And of course the kernel can do untranslated reads and writes. So generate physical addresses as you've probably seen in Notchos. Okay, so now, first question we can ask is is there page locality? So is it valuable to cache a virtual page to physical page number translation? And the answer is yes. So things like instruction access spends a lot of time on the same page because you tend to have sequences of instructions and then a branch. Most branches also tend to be very short, maybe 50 instructions or so. So a lot of times instructions are gonna be spending the same time, a lot of time on the same page. So for that reason alone, TLB would be useful. The stack has a lot of locality of reference because you're operating in the same stack frame, the local frame, repeatedly. Say you have an array or something that's stored in that that'd be on the same page. And then less so to a lesser degree, data accesses. So accesses to the heap are gonna have some locality like if I'm walking through the characters in a string that's gonna reference the same. But depending on how I walk through, say the elements of an array, my stride through that array may cross page boundaries or may also remain within the same page. Now, can we have a hierarchy of TLBs? Absolutely. We could have multiple TLBs as we'll see in a moment of different sizes and speeds. So again, remember with two level paging, there's two sets of memory references that we had here in order to read a byte. We had to read our top level page table entry, then we had to read our second level page table entry in order to be able to read our entry here. So what happens when we don't find something in the TLB? There's two choices. One is to do it in hardware, the other is to do it in software. If we do it in hardware, then when a fault occurs, a miss occurs, the hardware in the memory management unit is gonna look at the page table and walk through the page table. So it's gonna do the translation process, fill the entry, and then if it finds that the page table entry is valid, it'll fill in the TLB entry, and the processor never knows anything happened. So it's completely transparent. It's slow, but it's completely transparent. If the page table entry is marked as invalid, then it's gonna generate a page fault. And we'll see on Wednesday what happens, but basically the OS is gonna have to go find the page and figure out what to do. Other choice is to do it in software. As software programmers, anything they do in hardware, we can do in software and better. So on a miss, the processor receives a fault, we trap into the kernel, and then the kernel walks through the page tables to find the page table entry. If it, again, if it finds a valid page table entry, it fills in the TLB and returns from the fault. If it finds an invalid page table entry, it'll call the page fault handler internally. Now, why did I say anything we do in hardware, we can do in software better? Well, because in hardware, we can manage our TLB very simply. So like, for example, using a random replacement policy. In software, we can implement a much more sophisticated replacement policy. And since our TLB is small, the replacement policy could have a big impact on the performance as we saw from those example numbers. So architectures like the MIPS use a software traversed page table. Architectures like x86 use a hardware traversed and managed TLB. Okay, many chipsets though, give you the option of doing it hardware wise. But software, we can do a better job potentially, but it'll be slower. Okay, so what happens on a context switch? We have a problem, because the TLB is mapping virtual page numbers to physical page numbers. So we have to do something. We've changed the address space so that mapping is no longer valid. So we have a couple of choices. One choice is to invalidate the entire TLB. So this is called flush the TLB. Very simple, easy to implement. Just mark every entry as zero and invalid. But if we're doing a lot of context switching, this approach is not gonna work very well. Because the first thing that's gonna happen when we context switch, is we're gonna reload all those entries that we had from before back into memory because we're gonna walk all the page tables and do the translations all over again. The second choice is to actually include a process ID in the TLB. So now we need to make the TLB larger because it's gonna have to store mappings from multiple processes. And then this gets into what our replacement policy is gonna be. This is also a solution that requires hardware support. Okay, now what happens if we move a page from memory to disk, or we delete a page or unallocated, we have to make sure we invalidate the TLB entry. And then it'll automatically get reloaded the next time we try to access that page. So if the page went out to disk, we'll walk through the page tables, discover it's out on disk, the page fault handler gets called, lots of magic happens that you'll hear about on Wednesday, and then we get a valid entry. Now, how should we organize our TLB? It has to be really fast because it's on the critical path for all memory accesses. And so that argues that we might want to have a direct mapped or low associativity TLB. But we need the TLB to have very low conflict rate. And both direct mapped and low associativity TLBs would have much higher conflict rate. The perfect solution for no conflicts is going to be to have a fully associative TLB. So the problem that we can have is we have to pick some bits to use as our index. So if we say pick the lower order bits to use as the index, then we'd have lots of collisions between the different regions of a program, the first page of code, first page of data, first page of the stack, and the second page of code, second page of data, second page of the stack would all collide. So we'd need to have at least three-way set associativity to make this work without having a good replacement policy to avoid conflicts. What if we use the high order bits instead as the index? Well, the high order bits are the high memory portions of a region, and for most programs, those aren't going to be used, which means most of our TLB entries will not even actually be used. So we want to avoid thrashing, and so this is going to argue that we're willing to tolerate a slightly increased cost of access, so a slightly higher hit time in exchange for a much lower mis-rate. So the best organization is going to be fully associative, and how big do we make it? Typically, it's not very big, 128 to 512 entries. So we can do fully associative. Lookup is by virtual address, and it gives us back the physical address, it gives us back the page table entry. So a bunch of other bits can come along also. Now, what happens if a fully associative cache is too slow, because we've got 512 comparators that all have to be mucked together, well, we can actually have a hierarchy and put a very small, like four to 16 entry, direct map cache in front of the TLB. So if we miss in that TLB slice, then we'll look in our fully associative TLB, and then we'll go to memory, and this can work very well, because most of your memory accesses are typically in a program in our very small region, but this allows us to actually keep many, many more addresses mapped at the same time. Now, when should a TLB lookup occur relative to a memory cache access? It could either occur before we look in the cache, or it could occur, which is what we thought about so far, or we can actually do it in parallel. So now, put on your seatbelts. So, the way I've talked about it up until now is to say we do our translation from virtual address to physical address, then we take our physical address and take our byte select and our index and our tag bits and use that to look up in the cache. So that gives us this picture. But it turns out that our offset is available immediately. We know what the offset of the physical address is gonna be because it doesn't change. Unlike with segments, with pages, the offset is the same as the virtual address offset. So, here's how we're gonna organize our cache. We're gonna organize our cache so that the offset exactly covers the byte select and the index bits of our physical address. Now, we're gonna organize our cache as follows. We have our TLB, in this case it has 32 entries. We're gonna do an associative lookup in that for our physical page number. We have our cache, our cache is four kilobytes in size. Each cache line is four bytes. Thus, there are 1,024 entries, so that represents 10 bits and our index is going to be 10 bits. Our byte select is going to be two bits, one of the four bytes here. So, we can now immediately, as soon as we have a virtual address, take that offset which represents these 12 bits and look in the cache. So, we'll look in the cache and we'll do our byte select. So, what that will allow us to do is immediately fetch the physical address, the tag. In the same, in parallel we're looking up in the TLB taking our virtual address and getting back a physical address. Now, once both of those are ready, so we did both of those lookups in parallel to fetch those index on the right and an associative lookup on the left, we can compare those physical addresses. If they match, we get a hit. If they're equal and the data is stored here, we get a hit. And then we can return the data because we could already have done the byte select. So, this allows us to do both the physical to virtual address translation in parallel with the actual lookup of the byte in the cache with the physical address. Kinda confusing, spend some time after class looking at it. And if you really wanna kind of bend your mind, think about what happens if we make the cache bigger. Now this won't work because the overlap would not be complete. We'd need another bit, so we'd need 13 bits and our offsets only 12 bits. So, if you wanna know how to make that work, that's a plug for taking 152 or 252. Question? Well, no, it does matter. So, the question is if the cache has a miss, does it matter what's in the TLB? Absolutely, it matters what's in the TLB because we're gonna need to take that physical address and we're gonna have to go to memory and request the byte from memory, okay? And then we can load it into the cache. And we know now where to put it in the cache already because we've already looked up the index, yes. So the cache is just producing the tag that was stored. So that's gonna be, the tag is gonna be the physical page number, right? It's gonna be this physical page number that comes out of here. And we're then gonna compare to see is that cache index, that cache line storing the right physical page. If it's not, then we still have to go to main memory. Okay, so really quickly, let me go through the last part which is gonna pull everything that we've seen together. All right, so we take our multi-level address. We take our page table pointer which comes from the CPU register. We index for page level one in our page table. That gives us a pointer to physical memory for our second level page table. We take our second level page number and that gives us index into the second level table which gives us a physical address, the physical page number. We take our offset and that then gives us an address that we can look up in memory with our offset, all right? We can replace that with a TLB. And with the TLB, now we're gonna do our lookup in the TLB and the TLB will return our physical page number. We can add a cache to this and now we're gonna take our physical address and break it up into a byte offset, an index and a tag. We use the tag to rather the index to select a particular line. We use our tag to check whether we match. We use our byte select to select the actual byte that we want. That's it all on one slide, okay? So in summary, the principle of locality is why all of this works. We have temporal locality, we have spatial locality in our programs. There are three plus one different types of misses, compulsory conflict capacity and coherence misses and we looked at three different cache organizations, direct associative and fully associative and the TLB is just a cache of address translations and we can look up in parallel with our actual lookup in the cache, okay? See you on Wednesday.