 Welcome back everybody to CS162. So we're gonna pick up where we left off talking about virtual memory and memory mapping and then we'll continue with paging next time. But if you remember, we were looking at this idea in general of address translation and the memory management unit that does it. And in this scenario where virtual addresses are coming out of the CPU, they get translated by this memory management unit into physical addresses which represent the actual positions of the bits in the physical DRAM. So there's kind of two views of memory. There's the view from the CPU, which is the virtual addresses and the view from memory, which is the physical addresses and those two are basically related by a page table which is what the MMU supports. Now, the one thing that we did talk through last time is we actually gave you a couple of examples where we walked through some instruction execution that was going on in the processor. And we kind of showed you when you keep to the virtual addresses and when you actually have to translate into physical ones. So with translation, it's much easier to implement protection because if two processes have their translation table set up so they never intersect with physical memory then it's impossible for them to interfere with each other through memory. An extra benefit of this, of course, is that everybody gets their own view of their personal address space which means that you can link uniquely a program to the ones and run it multiple times on the same machine. Everybody gets their own zero is the way I like to think of that. We talked about simple paging in this context and the idea of simple paging is basically that there is a page table pointer and that page table pointer points to memory which is a set of consecutive translations we'll call these page table entries a little bit later. And these consecutive entries basically have both a physical page number and some permission bits. And there's one of these page tables per process. And so the way the virtual address mapping goes we talked about this as you start with an offset which is how big your page is. And so for instance, for a 1K page that's gonna be 10 bits for a 4K page that'll be 12. And then all the rest is the virtual page number and that offset never gets changed by the page mapping so that's copied directly into the physical address. And then the virtual page number is basically used as an index into the page table. You look that up and that gives you the physical page number which gets copied in and you're good to go on the physical address. So for instance, if you had 1K byte pages or 1024 pages, there's 10 bits of offset and what's left in red is basically 32 minus 10 bits or 22 bits in a 32 bit machine. So there's essentially up to 4 million entries in this page table, all right? And so among other things we're not necessarily gonna use them all and so we need to page table size. And so there are certain indices in here or virtual page numbers which are above the page table size in which case you get an error. And then we also need to check our permission. So if we try to do a write and you see that this particular page is marked as valid that's V and read, but not write. This is essentially a read-only page. So if you attempted to write to that address you'd get an error, okay? Are there any questions on this simple paging idea? Now we're talking about the function of that memory management unit I showed you earlier, okay? So the hardware there is gonna help us by translating these virtual addresses into physical ones. So the other thing I did is I gave you a very simple example this is almost a silly example because it's four byte pages, but because they're four byte pages you know that the offset's only gonna be two bits and basically what we can see here is that a virtual address 00 we write that all out into binary basically. The lower two bits are zero. The upper six bits in this case are zeros. And so we take what's in red here and that's gonna be our virtual page ID which we'll look up in our page table and what we see there is that there's a four, okay? And I don't have any permission bits in this example but that four represents the translated page, okay? And so I take that four that's really 001, 00, okay? That's the physical page ID. I copied the offset and that told me that things that are up here in the virtual address space are gonna be down here in physical space. And we looked at multiple options here as well like for instance, things from four to eight map basically to page three, which is down here. Things from eight to C are gonna map to the green up here and that's going through the page table, all right? And then of course, if you look at things in the middle like for instance, four, here has got an E in it, five has got an F, six has got a G. Where does that translate over here? Well, we can see that that's gonna be right basically over here, but the question is how do we get there? Well, we take the fact that six is 00, 00, 00, 11, zero because 11, zero is six as you all know. And that means that we're talking about page one which is the blue one, the offset is one, zero and that takes us over to this point, OE. And similarly, nine takes us over to this point, O5. Okay, now there's a question here is page zero always unmapped so that dereferencing null pointers always cause page faults. No, not necessarily, cause sometimes page zero is reserved for the operating system and can represent different things like IO and so on. So zero is not always unmapped. It often is, but you can't necessarily be assured of that. And that's why in fact, null references can be very bad cause some languages like C let them happen. Okay, so, but you could, if you can afford to unmap zero then obviously you get a little bit of extra protection there cause you could cause a page fault because zero wouldn't be valid in that case. So what about sharing? So first of all, actually let me stop here for a second. Are there any pieces of this that people are worried about? Now I told you last time that you need to get really good at transforming between hacks which is four bits at a time and binary. I'd get to memorize that quite well cause that's something that'll serve you well if you know how to do it, all right. Now, what about sharing? All right, so once we start having this mapping now we can do some pretty interesting things. Okay, here's a question. Let me just answer this. Why are we taking the top six bits when there are only three entries in the page table? Well, because presumably this is an eight bit machine and therefore everything that's not an offset, remember I said these are four byte pages are basically the page ID, the virtual page ID. And so this page table needs to potentially have up to 64 entries in it but because the page table only has three entries then the size of the page table is gonna be set at three. Okay, and if you notice back here in my previous example I showed you this idea that you have a page table size. And so what that really says is there are some virtual addresses in this scheme that are not valid. They're ones that are for which the virtual page ID is too big, all right. Good question. Okay, but we have to take six bits because we have to take all the bits that aren't the offset. Now, if you look here, so here's another example. We have our virtual page ID in the offset. And what's interesting about this scheme is that now we can do something like this, virtual page number which is gonna have a two in it. It's gonna be zero, zero, zero, zero, two here. Might map to a place in physical memory, okay. And we might have a second page table that also maps to that same place in physical memory. So now we have two processes, two separate page tables both mapping to the same physical page, okay. So this is interesting, right. This basically means that that physical page now appears in the address space of both processes. So they can share information, all right. So if process A writes to something in page two it'll show up in this page. If process B writes to somewhere in page four it'll show up in this page. And they can read and write each other's data, all right. Now, this is not a great mapping, okay. Why? Well, because I mapped the same page to different parts of the address space for these two processes. So in fact, if you look in process A I can read, write at address zero, zero, zero, zero, two x, x, x, x. And B I can read at address zero, zero, zero, zero, four x, x, x. And so the addresses are actually different, all right. Which means that I can't make a linked list here and have the addresses mean something between the two processes. So that's a little broken. So in fact, it would be better to actually link them to the same place. Now there's a good question in the chat here about can you arrange to set that up? And yes, there are virtual memory mapping system calls that allow you to map the same page to the same part of virtual memory and thereby make sure that you can do things like linked lists that are shared between multiple processes. Notice that the other thing that I've shown you here is that process A has both read and write permission to this page while process B does not. And so that might be a producer-consumer scenario where process A is producing something and process B is consuming it. And of course, once you've got shared memory then you need to synchronize and we get back to the synchronization we've been talking about. All right, questions. Can everybody see why I'm talking about all the addresses being of the form 0x0002xxx and 0x0004xxx? Okay, why am I saying that process A has addresses like 0x0002xxx? Yeah, so this corresponds to virtual page number two and number four. And if you notice, this is hex, right? So hex represents four bits. So I have xxx in this instance, there are 12 bits total. So I'm talking about a 12-bit offset, which means a 4K page, okay, in this instance. And then all the other bits, the remaining ones above are going to be the ones that I use for my virtual page number. And so also get comfortable with figuring out what's the offset and then what's left over is the virtual page number, all right? Good. And of course, if I map the page in the same place in both of these, then the addresses would exactly match and then I could make a linked list or something. Okay, so what's a typical offset nowadays? That's a good question. So 4K, 12 bits, very common. Some of the higher end machines might get you to 16K, okay? But 4K is pretty common, okay? Or 12 bits. Now, where do we use sharing all over the place? So remember, we started out this term at the very beginning saying we needed to protect address spaces from each other so that processes were protected from each other and the kernel was protected from the processes. But we have this sharing mechanism and I like to think of sharing as selective punching of the careful boundaries we've put in processes in a way that does the kind of sharing we want. So for instance, the kernel region of every process has the same page table entries for the kernel, okay? And that allows you to basically pop in and out of the kernel without having to change any page table mappings, okay? So the process is not, I'll show you this in a second, the process is not allowed to access it at user level, but once you go from user to kernel, like say for a system call, now the kernel code can both access its own data and the user's data, okay? But if it wants to access data from other user processes, it's gonna have to do something different at that point. If you want different processes running the same binary, we talked last time, I was accused of starting a culture war, but if you wanted to run Emacs multiple times for instance, or VI, if you wanna be a West coaster, then you can have the same binary stored in a set of physical pages and then multiple processes can link to that binary and you don't have to waste memory with duplicate code, okay? That's great. And that extends to dynamic user level system libraries. You can also make that be shared only, read only, excuse me, and then everybody can share them. And so obviously the last one is the one I was just showing you, which is sharing memory segments between different processes, allowing you to essentially share objects between different processes and thereby do interesting communication. Now, of course, you gotta be careful about that because the two processes are now trusting each other to put data in each of those, in that shared page that is properly formatted and can be properly interpreted by the other process, okay? So that's a little bit less secure potentially unless you're very careful. So now we can do some simple security measures also with this, like for instance, we can randomize where the user code is rather than always starting it at a particular part in virtual address space, we can start it in different parts. And that randomization, which I'll show you in another picture helps to make it harder to attack when you've got certain things like overflow errors and so on, which you might have heard about if you've taken 161. It also means the stack and the heap can start anywhere again for security reasons. And then we can also use kernel address space isolation where we don't map the whole kernel space, but just part of it. And that can give us a little more security. Notice that when we talk about Meltdown, which we will mention in a subsequent lecture, we have to in fact make sure that essentially none of the kernel space is mapped into user space. But we'll get to that a little later. But if you look at this scheme I've got here with user space and kernel space, what this means is that because of bits that are set in the page table entry, when I'm at user mode, I'm not able to actually access any of the kernel page table entries even though they're mapped, they're not available to the user, but the moment that you take a system call and now suddenly both the user's memory and the kernel memory are all available to the kernel. And this makes it much cleaner and simpler to do a system call. So here's a typical layout I actually showed you this last time, but we see a bunch of holes in here and these holes are basically allowing us to do randomization and thereby making it harder to put executable code on the stack and a few other things and harder to attack. And so that's a good security measure, okay. But all of these holes, our whole things that we need to support and of course unfortunately so far with our page tables, we don't have a good way to support holes because in order to go from zero up to FFFFFF, we need to have all of the page table filled 100% filled and a lot of these empty spots are just gonna be null entries that say invalid and that's a waste. And so we need to do something different and that's part of our topic next, okay. Questions. Okay, so right now the page table I've showed you, this is answering a question in the chat, doesn't actually allow you to spread everything around without wasting a bunch of entries that are null, okay. So right now, in order to map this virtual space, I would have to have all of my entries but a bunch of them are gonna be empty and that's a waste and so we're gonna fix that, okay. And you're right, we can map around in physical space anywhere we want but the virtual part of this is wasted yet. So just to summarize, I just wanted to give you a little bit. Here's an example, we have to have, here's our virtual memory, we've got all these holes, that means the page table has to be 100% full, okay. So those advantages that we might have by setting the length of the page table to less than the full size, we lose it because we have to map the whole page table and that's because we need to have things like the stack at the top of the page table and things like the code near the bottom and so that's a waste. The other thing I wanted to show you here is this virtual memory view goes through the page table and maps to data that's potentially spread all over in the physical memory and I'm even showing you some gray things here which represent other processes and so that scrambling of the physical memory is a big advantage of page tables because now we can manage it much easier because every one of these pages is exactly the same size and we can allocate or deallocate them anyway we want. The other interesting thing I wanted to show you here is here's the typical stack grows down, heap grows up. If you notice in this case, the stack only currently has two pages associated with it that are actually mapped, the rest of these entries like the one right underneath the stack, if you look over here in the page table is currently got a null entry in it, okay. And that null entry means there's nothing mapped in here. So the moment that we get to try to go below that stack and suppose we're just pushing things on the stack and we hit this point, we're gonna cause a page fault which we'll talk a lot about next time and at that point, we can actually add some more memory, okay. So if the stack grows, we just add some more stack and now all of a sudden we've got more stack. And so what's great about this is that we're able to start with the smallest amount of physical stack that we can and we'll grow the stack as a process needs it so we don't have to commit physical resources to the stack because the page faulting lets us grow that dynamically as we need it. The page table base register is actually gonna be in CR3 in the x86 processor. And so I'll show you that in a moment, okay. Challenge now just to summarize what I've been saying here is that the table size is equal to the number of pages in virtual memory. So if you were to count up the number of potential pages even the empty ones, that's the size of our page table and entries. And that's the thing I'm saying is really unfortunate about what I've told you. So clearly what I've told you isn't really the full story, okay. And that's our next topic. So how big do things get? All right, so let's talk about size. So if we have a 32-bit address space, by the way, I'm gonna go through these just to make sure everybody's on the same page with their powers of two, okay. This is, you are now Uber, OS students. And so you need to know these things and you'll know them well. So for instance, in a 32-bit address space, two to the 32 bytes or four gigabytes, okay. And notice that I've got G, capital G, capital B. So a lowercase B means a bit, a capital B means a byte, okay. That's eight bits. For memory, all right, kilo is not 1,000. It's two to the 10, which is 1,024. That's almost 1,000, but not quite. Now, I think in 61A or something, they might've called this a kibby, okay. And kibbies are great if, although they always sound like cat food to me, but they're great if they come for you, okay. A kubibite, I don't know how much a kubibite is. It's really big. M for mega is two to the 20, which is almost a million, but not quite. Okay, it was more than a million. Sometimes called a nibby. G for giga is two to the 30, not quite a billion, sometimes called a gibby. The thing that you need to know is that when you're dealing with memory, you need to sort of mentally translate K, M and G into the powers of two rather than powers of 10 because people don't always give you K, I, M, I and G, I. In fact, they far fewer than they do it far less often than you might like, okay. And it might be maybe with an E, that's true. So the other thing that's a little confusing about this, by the way, is that when you start dealing with things like network bandwidth and you say kilobytes per second, that is a power of 10. Okay, and so this, unfortunately, this terminology is very confusing. And I just want you to be aware of the confusion because you're gonna run into it as you go. So typical page size, as I said, was four kilobytes, which is how many bits? Well, if two to the 10th is 1,024, then four kilobytes is an extra two bits because two to the second is four, and so that's 12 bits. Okay. How big is a sample page table for each process? Well, let's look at this. If a page itself is two to the 12th in size bytes and they're two to the 32 total, I just divide the two, which means I subtract the powers. And I get two to the 20, which is about a million entries. And they're gonna be four bytes each. I'm gonna show you what the entries look like in a moment. So that's about four megabytes would be wasted in a page table where a lot of them are empty. So we're gonna need to do something different. So when a 32 bit machines first got started, this is things like the Vax 11,780, the Intel 83,86, et cetera, 16 megabytes was a lot, okay? And so four megabytes was a quarter of all memory. So this is clearly not something we wanna do. And just to hammer this home, so how big is a page table in 64 bit processor? All right, so two to the 64 over two to the 12th is two to the 52nd, which is about 4.5 XI entries, 4.5 times 10 to the 15th. They'd be eight bytes each, which is 36 exabytes in a single page table. All right, that's clearly a waste. And so this page table thing that I showed you, I'm calling it a simple page table. It's clearly not what we want. This is just a lot of wasted space, all right? Questions? So the address space fundamentally sparse. Remember all those holes I showed you? And so we want a layout of our page tables that handle holes well. And really what's a page table? So let's think about this. What do you need to switch on a contact switch? Well, you just need to switch the top pointer. So that's easy in some sense to the address space. Now, what is not so easy is oftentimes you have to flush a bunch of TLB entries and so on. Switching the address space can be more expensive than just switching the pointer. Now, what provides the protection here? Well, translation per process and dual mode execution. So what that means is only the operating system is able to A, install the page table pointer and only the kernel is allowed to change the page tables. Because we can't let the process alter its own page table. Now, the question about is the process's page table stored with its PCB? Typically it's a different part of memory. It's kind of like on the kernel's heap in some sense because a PCB has sort of got pointers to everything but it doesn't necessarily contain big things like page tables. That's a good question though. It could in principle, it often doesn't. But some analysis here is the pros of the page table thing that we've come up with so far is it's very simple memory allocation because every page is the same size. It's easy to do sharing. The cons are, if the address space is sparse, which it is, then you start wasting a bunch of entries. If the table's really big, now the problem is that you're not running every process all the time and so you're wasting a huge amount of memory and it'd be really nice if we could have an actual working set of our page table. And so you can see that we're gonna stray into caching very quickly here. So the simple page table is just way too big and we don't wanna have to add all in memory, et cetera. And so is there something else we can do and maybe we could make our table have multiple levels in it. And so that's where we're going. Now, does the page table also specify whether something's accessible to the user or the kernel? Yes, and there's a bit in the page table entry. I'll show you that in just a second. So how do we structure the page table? Well, a page table is just a map or a function from virtual page number to physical page number, all right? Like this, right? Virtual address in, physical address out. And so there's nothing that says that this just has to be a single table. If it is a single table, it's very large, just ridiculously large as we just showed. What else could we do? Well, we could build a tree or we could build hash tables, okay? You think of it, we could come up with it. And so one fix for the sparse address space is the two level page table idea. And I wanna show you what I like to call the magic page table. This is a fun one. You'll see why it's magic in a moment. But this is for 32 bit addresses and it's a tree of page tables where we have 4K pages. So 4K pages means 12 bits of offset and four byte page table entries, okay? So these are four bytes total and I'll show you what's in those four bytes in a moment. But what that means is that we can take the virtual address, we have our 12 bit offset and two 10 bit indices and the first 10 bits goes to the first level page table and it's used to select one of 1,024 entries which there will be because 4K bytes divided by four bytes is 1,024. And then the second one will actually pick the second level and that will give us the physical page number and of course we copy the offset there, okay? And so the tables in here are all fixed size, right? And in particular, they're all 4K bytes in size. So these page table sub entries are four kilobytes. This one is four kilobytes. The pages themselves are four kilobytes. So what's cool about everything being four kilobytes is now we can start talking about swapping parts of the or paging out parts of the page table to disk. And so only those parts of the page table we're actively using even have to be in memory, okay? Now, of course the top level one always has to be there if that process can run but there's a whole bunch of other ones at lower levels that don't have to be there, okay? So the tables are fixed size on a context which we just have to save this single page table pointer in the PCB for instance and the page tables themselves aren't necessarily stored in the PCB but that page table pointer is the address space descriptor and just by switching that out possibly with flushing TLBs we'll get to that later is enough to change the whole address space of the machine and go from one process to the next, okay? Now the valid bits on the page table entries I sort of indicated we could swap out the pages but what did I mean by that? Well, if you look at this situation where we take 10 bits, we look it up in the first level page table if we had this second level page table that wanted to be out on disk we could actually mark this first level as invalid and then what would happen is we would try to look up this virtual address those 10 bits would look up the first page table entry we would see it's invalid, we'd cause a page fault that page fault would then get resolved by the operating system by bringing the next level page table in we'd retry and now the first 10 bits would work the second 10 bits would get tried and maybe this one would be marked invalid in which case we pull the actual page in from disk and then we finally are able to actually do the reference. Now that sounds really expensive because the disk remember access is a million instructions worth but because of a sort of a caching view of the world we only do this once and then the multiple one times that we do that afterwards everything's faster, okay? All right, questions. Now, good question. Is the information about how the page table structure built into the hardware? Yes, all right. So that typical machines these days like the memory management unit I showed you that would be on the x86 have a particular structure for the page table built into them, okay? And it's thereby the same for pretty much all processes that are actually running on a given machine at a given time. Now, some machines like MIPS processor line and basically the things that were related to them actually do something a little different where they don't have hardware that walks its way through the page table or does what we call a page table walk. They actually have software and when you try to access something that's not in the TLB which we haven't heard about you'll actually trap to software and then the software can pretty much structure the page table any way they want but the page tables you're dealing with now with Pintos on the x86 that's a hardware page table walk and so the structure of the page table is absolutely built into the hardware, all right. Now, here is the classic 32 bit mode of an x86. I just wanted to show you this. So the Intel terminology rather than saying there is two levels of page table they actually call that top level a page directory but that's just a terminology thing but essentially you have the CR3 which is the register only accessible to the kernel that defines the top level page table it points at the page directory we take 10 bits off of the address pointed at that page directory that gives us the next page table, okay. So that's actually gonna give us a 20 bit pointer to the next page table and physical memory 10 bits come out of the table the next 10 bits come out that looks up the next page table entry that'll give us 20 bits that represent the physical page and then we combine with the offset and that gives us the actual final address we're looking at, okay. Now I just threw something at you very quickly but let's see if we can understand this if you look at the way addresses even the physical ones are structured there's a 12 bits of offset and 20 bits of either virtual or physical address. So that means that when I specify a physical page I have to give 20 bits of unique address to specify that physical page and then the offset, the remaining 12 bits can be anything we want but that physical page is defined by 20 bits which means that this page table entry has 20 bits of physical address in it, okay. All right. So some administrative yeah midterm tours coming up Thursday, 1029. So topics are gonna be up until lecture 17. So we have some good topics for the midterm we've got scheduling, deadlock, address translation virtual memory, caching, TLBs, demand paging and maybe a little bit of IO. So first midterm was somewhat of a dry run this next one will actually require you to have your zoom up and working when the TA proctoring TA pops into your zoom room you need to have things going. So just be aware that you should make sure you get your setup going. Things are gonna be almost the same as they were last time except which worked reasonably well except for the fact that I think we're gonna pre-generate all your zoom rooms for you and then you're just gonna be connecting to them but watch for that. And anyway, we wanna make sure that your setup is debugged and ready, okay. There will be a review session we don't have any details on that yet but we'll get out the zoom details on that. And the most important administ trivia that I wanted to say is the US elections coming up for those of you who are citizens or have the ability to vote, absolutely vote. This is the most important thing that you can do as a US citizen and you need to do it. And actually if you don't do it then you don't get to complain about the results but I would say don't miss the opportunity. And of course be safe if you can vote by mail do that otherwise wear a mask and social distance but be careful, okay. But I would say without being political that this is potentially the most important election in a century, so don't miss it, all right. Yeah, and those of you in California don't go to any of these fake ballot boxes there actually are a bunch of them. You can go to the post office and what's even better you can sign up to find out about the status of your ballot. There's an online thing to do that. I did it, it's awesome. I got a notification, a text, the moment the post office found it, scanned it and then when it got to the destination it said it will definitely be counted so you get a text every time something happens. So be careful of the fake ballot boxes. Thank you for that, Ashley. All right. Good, so what is this page table entry of which I speak here, okay. It's basically, it is the entry in each of the page tables and it's potentially a pointer to the next level page table or it's the actual page itself. It's got permission bits like valid read only, read write, write only, okay. So I'm gonna give you an example of the x86 architecture. The address is the same format as the previous slide, okay. So this is gonna be for the magic 10, 10, 12 bit offset. Intermediate page tables are called directories for x86 but here it looks, okay. So it's 32 bits or four bytes. And if you notice there's 20 bits of physical page number because remember I said you needed 20 bits to uniquely identify a 4K page and then the remaining 12 bits are interesting, okay. The lowest bit here is the presence bit, okay. Most everybody except for Intel calls it the valid bit. Intel likes to name things differently than anybody else. So they call it the presence bit but the same idea if there's a one that means that this page table entry is valid and you can go ahead and do the translation. If it's a zero, it means it's invalid and all the other bits, all 31 of the remaining bits are essentially free for the software to use and that can be an interesting way to keep information about where that page really is if it's not valid and mapped in memory. So that's the present bit. The writable bit W actually says whether this page is writable. The U bit basically is this a user or kernel page and so if it's zero, it's a kernel page, it's a one, it's user. I believe I may have that reversed. Look it up in the spec. Then we have some things about caching. So the PWT and PCD are whether there's no caches allowed or not. So page write transparent means you write straight through the external cache and PCD means the cache is disabled. These two things are important when we start talking about memory mapped IO. So we'll talk about that in a few lectures. A says whether this page has been accessed recently or not and that gets reset by software but set by hardware. D is whether it's dirty which gets reset by software and set whatever you do a write to that page. And then this PS is gonna give you a page size. So if you set this to zero, it's exactly the 10, 10, 12 I showed you. If you set this to one, then there's only one level of page table and you can get four megabyte pages out of it which you might use for the kernel. Okay, questions. Now, what can you use this for? We'll talk more about this next time and the time after but invalid page table entry where the P bit is zero for instance can imply all sorts of things. One is that the region of address space is actually invalid. So there may be a hole in the address space that is never gonna get filled. Okay, and in that case, a page fault will occur and potentially the process will be faulted. The other option is that, well, it's not valid right now but the page is somewhere else. And so potentially go out to disk to pull it in and that means that after the page fault happens the kernel will reset the page table entry such that the valid bits now one and then you retry the loader store and at that point it'll go through. Okay, the validity portion is checked always first and so that means the remaining 31 bits can be used by the operating system for location information like where is it on the disk for instance when page table entry is invalid. So a good example is demand paging. Okay, this is the simplest thing when you hear about paging right off the bat. Demand paging means that we only keep the active pages in memory. The rest of them are kept out on disk and their page table entries are marked invalid. And so now rather than having to swap a process out like we talked about a couple of lectures ago by sending all of it segments out to disk now we can send just pages that aren't being used out to disk and we can get much more efficient use of memory that way. Another interesting one is copy on write. So we talked about Unix fork multiple times and the interesting thing about Unix fork if you remember is when we fork a new process both the parent and the child in that case have a copy of the full address space. And we talked about that rather than being so expensive that you copy everything. What you do instead is you copy the page tables you mark them all as read only and the moment either the parent or the child tries to write then they'll get a page fault. And at that point we'll copy the pages and make two copies it's called copy on write. Okay, another is zero fill on demand. You can say, well all of these pages are gonna be zero because we wanna make sure that we don't accidentally reveal information from the previous process and use that physical page. What we do there is we mark the page as invalid and the moment you try to access it you get a page fault and the kernel zero is a physical page for you, maps it and gives it back to you and that's a zero page on fill, okay. And so we're essentially doing it's like late binding for those who have taken interesting language classes in CS we're kind of late binding our zero fill and our copies. Okay. So here is another example that's kind of interesting of sharing. So I'm showing you two processes with page table pointer and page table pointer prime. The important thing to see here is the green part and notice that we're actually saying that a couple of whole sub pieces of the address space are shared, okay. So the question about does that mean zero filling pages doesn't actually delete the information from physical memory? Well at the point that you hand it over it does overwrite it, okay. So it makes sure that everything is fully written and you don't have to worry about what happens before you do that overwriting because you can't even read it. It's marked as read only is invalid and so the moment you try to read you get a page fault and then the kernel fills it with zeros and gives it back to you. So no worries about your secret keys at that point. Okay, so we can share whole sub pieces, all right. And you can imagine that perhaps a whole big chunk might represent the user's space and you could have a user page table and a kernel page table that just had user plus kernel entries in it and they mostly share the whole page table and this is gonna be useful when we talk about the meltdown problem, okay. But we'll talk about that later. All right, so for two level paging, very simple, okay. Just like before, here we have an address and the first three bits are used to look up the first level page table, the second three bits, look up the second level page table and then you get the final actual physical mapping and the point is that this particular slide is showing you virtual memory position mapping all the way to physical memory just to get a better idea how that multi-level mapping goes, okay. And notice that we do copy in this case, 0, 0, 0 gets copied to the offset, okay. So in this case, in the best case, the total size of the page table is approximately equal to the number of pages used by the virtual memory. So this page table is, there's not as much wasted space as a single page table because if we have big chunks of null of non-mapped space, what we do is we put a null in the top level page table and then we don't even have to have the second level page table. So we save a whole bunch of space in the page table when we have sparse tables, okay. Now, we can take this, this is like a meme, right. We can make multi-level pretty much anything we want and if you can think of it, it's been done. So what about a tree of tables? So the lowest level page table might still be to pages and map with a bitmap like we talked about, the higher level might be segmented and you could have many levels. So here's an example where I take the virtual address, I split off some segment ID at the top and I have a page number and then I have an offset, okay. And so I copy the offset always do. And now the virtual segment number goes to a segment table and that gives me a base which is in memory for a page table at which case I use the virtual page number to look up the page table entry and that gives me a physical page number. And of course, for all the reasons of sparseness I talked about what you're really gonna do is you're gonna have a segment number and then two levels of page table to deal with sparseness, okay. And then we're gonna check for access errors, like is it valid, is it writable or not, okay. And so there's various places I can get errors. What do you have to save and restore in the context switch here? Remember for the simple page table only we just have to save and restore the base. In a segment situation, typically as I said a few lectures ago these segment registers are stored on the processor. And in that case, you gotta save and restore the segment registers during a context switch. So this is a little bit more expensive. Now you might say, wait a minute why are these segment registers not stored in memory? Simply because there's such a small number of them they're typically just stored in the processor, okay. Because it's much faster than going to memory. And you're only paying the cost when you do a context switch, okay. What about sharing complete segments? Well, you know, this is, I'm giving you, you know, obvious things. This is par for the course, but you can have the virtual segment number of process A and the virtual segment number of process B, both point at the same chunk of page table. And now they're essentially sharing that all of the pages that are in that page table amongst these two processors, okay. So the cool thing about the flexibility to get out of these mapping schemes is you can do whatever sharing is appropriate. The key there being that you're punching these holes in the protection afforded by processes, you're punching these holes carefully so that you're only sharing when you want to rather than sharing and not knowing that you're doing so. Okay. So the pros of the multi-level is you only need to allocate kind of as many page table entries as you need for the application. And I'm gonna say that approximately, right. And that's basically gives you a way to have sparse address spaces. It's easy memory allocation. Why is it easy memory allocation? Because the pages are all the same size, okay. And so it's really easy to put those pages on a free list. In fact, you don't even have to put them on a list. You just have to have a very large bitmap. It's easy sharing. I just showed you many ways of sharing, okay. Cons are there's a pointer per page, typically 4K, 16K pages today. If you got a 64 bit address space, I'm gonna show you this in a moment. The page tables still add up even when you've got multi-level page tables, okay. And these page tables need to be contiguous. So that means that each of the sub pieces have to be contiguous. But that's okay, because we're allocating things in 4K at a time, right. And so in the 10, 10, 12 configuration, the page tables have been set up to be exactly one page in size so that the same allocation can be used to allocate both the page table entries and the pages themselves. Now the other con which we haven't addressed yet is I've slipped something underneath here without you guys realizing it. This looking up multiple levels of translation, there's time involved, okay. This is not magic. It's hardware. And so every level requires cycles to go to DRAM to look things up, okay. And so how are we gonna possibly deal with that? Because that just seems like I've turned something that was fast, loads and stores to cache and I've turned it into something slow. What am I gonna do? Anybody have any ideas? Caching, exactly. TLB is the type of cache, exactly. Now we're gonna use caches, all right. And for the person who asked about virtual caches, these are all gonna be caches where the index in the TLB is gonna be virtual, but for the actual data caches are gonna be physical. And I'll try to mention that when it gets there, all right. Now, if you remember for dual mode operation, I just wanna toss this out again, can a process modify its own translation table? No, because if it could, all of this protection's gone, only the kernels should be able to modify A, the tables themselves and B, which tables are in use, that setting CR3 can only be done in the kernel. So, and to assist with protection hardware is giving you the dual mode, right. We talked about kernel mode versus user mode and even though in x86 there's four of them, we're really only using two and there's bits and control registers that get set as I go from user mode to kernel mode and back, okay. Just remember this, all right. And in x86, there's actually rings where ring zero is kernel mode, ring three is user mode and sometimes the ones in the middle are used when you're using virtual machines, okay. And there is some additional support for hypervisors, which we'll talk about in a later lecture that sometimes people call ring minus one or something like that. All right, so summary of all of this now that we're no more about memory mapping is that certain operations are restricted to kernel mode. Things like modifying the page table base register can only be done in kernel mode. Page tables themselves can only be modified in kernel mode, okay. Now, there is a question here about can we use virtual caches and avoid some of the slow translation problems? And the answer there is we could except that virtual caches have all sorts of consistency problems with them. And the simple way to see that is that since every process has its own notion of zero, the moment you put in virtual caches, it means that if you try to switch from one process to another, now you gotta flush the cache because the notion of zero for the first process is different for the notion of zero for the second process. And so virtual caches are not used very much these days because they have that very complicated mess involved in having to flush them when you switch from one process to another, okay. Now, here let's make this real for a second. Here's the x86 memory model with segmentation. So here we have a segment selector, all right. And typically you get that segment selector out of the instruction. This is for instance, the GS segment, okay. And that segment selector now gets looked up in a table and that table's combined with an offset to give us a linear address. And now we combine, we have that combined address which is a linear address, 32 bits. And now we take that linear address and we look it up. So this is the virtual address space as set by the segments. And now we go ahead and we look up the first page directory, the page table, oops, and we look it up. So we actually have a segment followed by two page lookups, okay, all right. And the thing about virtual caches is, yes, it's expensive which computer architects hate expensive operations because they slow everything down. Okay, now I just wanted to show you this is very briefly to say a little bit more about what's in a segment in x86. So segments, there are six of them, typically the SS, CS, DS, ES, FS and GS segments. They look like this. So there are registers. I was making them green earlier. I probably should have made this mint green but a segment register has 16 bits. 13 of them are a segment selector and then there's a global local bit and then there's the current mode that you're in. And so what's in that segment register? So it's a pointer to the actual segment. So what's in the processor is a pointer to what's in memory. And if you look here, what's in memory is a big table, two different tables actually, the global table and the local table, depending on which bit you've got here, you look it up, that gives you a segment descriptor and in that segment descriptor is a set of bits that sort of tell you where the segment starts in memory, how long it is and what are its various protection bits. Okay, and if you're wondering why this is so messy, by the way, you take the things of the same color and you put them together and that gives you the actual offsets and position of the segment. You know, anybody guess why this is so messy? Not easier in hardware. It's just messy, it's not complicated in hardware but that's not the reason it's messy. Yeah, because it's really messy you have to deal with it in software. So nobody in their right mind would make something like this unless there was a reason. And the reason is that they're trying to do backward compatibility with the original X86 processors even as they expanded them to 32 and 64 bits. So messy. But notice that the original six segments have this RPL which basically for the code segment tells you what your current privilege level is zero or three. Okay, all right. And the difference between CPL and RPL has to do with the privilege levels of the actual descriptor itself versus what the segment register says. Okay. Now, how are segments used? Well, there's one set of global segments for everybody. Another set of local ones that are per process in legacy applications, the 16 bit mode is utilized and the segments are real, okay. They actually have a base and they have a length and they do something helpful, okay. And they were originally not paged that way. Once we get to 32 and 64 bit mode what happens is what I showed you earlier which is that the segments are used to figure out what the linear address is and then that just goes through a normal paging scheme and modern operating systems this question was on Piazza as well is really if you notice the segments at least the first four segments are all set such that the base is zero and the length is four gig which effectively makes them not do anything useful. And the reason for that is that basically operating systems just don't bother with segments and they call that flattened address space, okay. And so you have to keep the segments there because the hardware needs them but essentially they're set up in a way that doesn't do anything, all right. The one exception is the GS and FS segments are typically used for thread local storage. And so every thread can potentially have a little chunk of memory that is unique based on its identity. And you can do things like move GS offset zero into EAX that's actually getting the zero with entry in the thread local storage for that thread. And that was originally supported by some of the GNU tools like GCC and it's certainly been part of modern operating systems for a long time. The other thing that's interesting is when you get to the 64 bit mode the hardware doesn't even support segments anymore. So even though they're still in the instructions in fact the first four segments have a zero base and no length limits and are unchangeable. And so that flat mode has been basically baked into the 64 bit hardware and the only ones that still have some functionality is FS and GS and that's because of the thread local store. Okay, so you could almost say that segments are essentially unused in modern x86 operating systems pretty much, okay, except for the thread local store. It's definitely would be faster to not have the hardware support segments but if they're not used but the x86 several of the modes basically have them and so they need to support them, okay. And if they were start from scratch and say they were building a risk processor like risk five which you guys are aware of you might not put segments in at all. Now, what about a four level page table? Well, here's a typical x8664 there's actually four nine bit entries. The physical address page number is long enough that these entries actually have to be eight bytes long, okay. And so that's why we have nine bits here instead of 10. And so to look up from a virtual to a physical address you actually have to look up four things, okay. We're starting to get pretty expensive. And then when we get to virtual machines which we'll talk about then potentially you double all of this and it gets even more expensive. So for the x8664 architecture, here we go. CR3 and then four references to get to the actual page, okay. And interestingly enough you can even have larger pages. So if you look here, let me back this up for a sec. If you notice we take CR3 that gives us first level, second level, third level, fourth level. If you actually look in that first, second, third level there's a bit in the page table entry there that if we set it equal to one, there's no fourth level. And therefore we actually get two megabyte page out of this because the offset is 21 bits rather than 12. And we can also go even further than that if we set PS equal to one in this second level page table we can get gigabyte pages, all right. So that is a mode supported by the x8664 and these larger page sizes kind of make sense memory so cheap these days. But the trick there is if you allocate really large pages and they're not used, now you got internal fragmentation waste again. And so these larger page sizes are typically used by things that are fixed, always present and unlikely to be paged where the kernel is a good example or maybe if you're building a special operating system for something that's streaming really large items you might use some of these bigger pages but they're certainly available. What happens to the higher bits? That's a really good question and I will show you that in a later lecture but the bottom answer or the simple answer is if you look at the virtual address here in all of them the higher bits are all the same and what that means is either they're all zeros or they're all ones and everything in between where they're not all the same is a page fault and what that looks like in the physical address space is that you have a chunk at the top and a chunk at the bottom and a really big hole in the middle and that really big hole is a permanent page fault that you can't map anything into and the reason that they do it that way is typically the things at the top or kernel and the things at the bottom or user. Okay and as you expand your hardware you can add more and more bits as you go. Now there was IA64 actually had a six level page table. No, too many bits. This was basically an Intel architecture that was designed for really huge machines and they were gonna map all 64 bits but they didn't wanna do it this way because there's way too much to look up and so the question is what else could we do if we're trying to build a table that's mostly sparse? Well, we could build a hash table. Okay, so all of the previous things we're looking at are called forward page tables because you take the virtual address and you peel bits off and you look up in the first level and then you peel off some bits in the second level, third level, fourth level and instead we can do an inverted page table which looks kind of like this where you take the virtual address and this virtual page offset and you look it up in a hash table and that gives you the physical page. Okay, so the advantage of this is now that this hash table is related in size, I'm gonna say of order size of the number of physical pages you have in DRAM whereas this scheme, the size of the page table is related to the number of bits you have in the virtual address. Okay, think that through, even though we've done this good job of keeping things, allowing things to be sparse by doing a forward page table, the size of the page table is of order of the size of the virtual address space, not the amount of DRAM you have whereas here this is of size, the amount of DRAM you've got. Okay, and so that's why inverted page tables have shown up in a few architectures over the years actually supported in hardware, okay? So things like the PowerPC, the UltraSpark, the IA64, that's one I just showed you, all had inverted page tables supported in hardware and there's more complexity to it and so the hardware is a little more complicated and the page tables themselves don't have any locality because it's a hash table and so while the previous things we were showing you, the page table entries can be cached in the cache, here it's much harder with the inverted page table, okay? And what makes it inverted is really that we're taking the virtual address and kind of looking up the physical page rather than taking the virtual, or the way to think about this is this hash table is got one entry per physical page whereas the previous thing has an entry per virtual address, okay? And that's why it's inverted. It's what's stored in here is an entry per physical page whereas what was stored in the other one was kind of an entry per virtual page and whether it's faster or slower depends a lot on the architecture. It's certainly potentially a lot faster than looking up nine entries. It's certainly not simpler though so it's a question of simplicity of the hardware. So the total size of the page table here is roughly equal to the number of pages used by the program in physical memory rather than the number of pages in the virtual memory. So we can compare some of our options. We talked about really simple segmentation. That was actually an example of what was in the very first x86's before there was even paging. That was before the x, the 83, 86. You get very fast context switching because you're just changing the segment map and there's no page tables to go through so it's very fast but we got external fragmentation. So we very quickly got rid of that and we put in some level of paging and the different schemes we've been talking about all have some advantages or disadvantages. The simple paging was just, they had no external fragmentation but the page size was huge and you couldn't have any sparseness in the virtual memory. And then we talked about several different options here for paging. And all of the remaining ones other than the simple segmentation basically all have the page as a basic unit of allocation and thereby we don't have that external fragmentation probably did with segments. So how do we do translation? Well, the MMU basically has to translate virtual address to physical address on every instruction fetch, every load, every store. Those of you that remember 61C and caching will remember there was a lot of work done to try to make a load and a store fast by having first and second level caches. What I've just done here in the way I've described this lookup so far is I've made that really slow again because yeah, maybe my cache is fast but before I can figure out where to look in the cache I got to go to DRAM and look up a bunch of stuff in a page table and then I have an address which I can then look up quickly in the cache. The one example where that's not true would be with a virtual cache but we're gonna talk about physical caches because those are pretty much what everybody has these days, okay? So what does the MMU do on a translation? Well, on a first level page table it's gotta read the page table entry, check some valid bits and then go to memory. Second level has to read a couple of page table entries out of DRAM, check valid bits and so on and level page table much more expensive. And so clearly we can't go to the page table all the time or we got problems, we've just destroyed all of the cool cache locality we've been working with. So what do we do about this, okay? So where in what is the MMU? So typically we have a processor and then we have the MMU between the processor and the cache and then we have a memory bus where the physical DRAM is. So the processor requests read virtual addresses the memory system through the MMU to the cache. And so we wanna figure out how to make this thing fast. When we make this request, sometimes later we get data either back from the cache or back from the physical memory and we wanna try to have the principles of locality work well on this, okay? So what is the MMU doing? The MMU is, well the simple thing is it's translating right from virtual to physical. As long as it does that translation correctly we don't care how it makes itself fast. So there's nothing that I've done so far in describing the translation as a tree of tables that requires us to go to the full tree of tables all the time as long as we can keep something fast consistent with the actual tree of tables, okay? So let's see if we can use caching to help, okay? So if you remember caching, okay? And this is a picture of me and my desk which you guys can't see because of my background, right? On my Zoom. Cache if you remember is a repository for copies that can be accessed more quickly than the original. And we're gonna try to make the frequent case fast and the infrequent case less dominant. So caching basically underlies everything in computers and the operating system I like to joke is all about caching everything, okay? So everything's a little bit about protection and dual mode but the rest of it's about caching. So you can cache memory locations, you can cache address translations, you can cache pages, file blocks, file names, network routes, you name it, you can cache it and the rest of the term is gonna be about how we can use caching in clever ways to make things faster. It's only good though if the frequent case is frequent and the infrequent case is not too expensive. So that means that when I put something on that desk so that I can look at it frequently, it better be the case that I am frequently looking at things on my desk, otherwise I'm basically just wasting desk space. It also ought to be the case that when I can't find something on that desk that it doesn't take too long to find it, otherwise everything is just slow no matter how good your caching is, okay? And so an important measure which this is just reminding you guys is the typical average memory access time AMAT which is the hit rate times the hit time plus the miss rate times the miss time, okay? And that should be familiar to you. So hit rate plus miss rate together add to one. So for instance, this is 61C idea, right? So the processor has to go to DRAM it's a hundred nanoseconds all the time or if I put a cache that's one nanosecond, maybe I can make this a much faster operation on average if I can put some the right stuff in the cache. So the average memory access time which I just showed you is for instance like this if the hit rate of the cache is 90% then the average memory access time is 0.9 times one which is the time to get that out of the one nanosecond cache plus 0.1 that's that 10% where I miss it times 101. Now why 101? Well, because in a situation like this I go all the way down to DRAM I pull it into the cache and then I do that last access out of the cache. And so the final here is on average 11.1 nanoseconds as opposed to a hundred nanoseconds. So I've gained a lot by having a 90% hit rate, okay? If the hit rates 99% notice that my average comes down to two nanoseconds 2.01. So the higher my hit rate the better I can do, okay? And the other thing is you can do the following you can say that miss time includes the hit time plus the miss penalty. So when I miss it's both the time to hit which is the one nanoseconds plus the time to actually do the miss. So that's why I ended up with 101 nanoseconds there, okay? Now another reason the reason to deal with caching is basically this, right? Look at all of these lookups in various memories looking up things, checking permissions, et cetera. We just got to do this somehow quickly, okay? And if we're, if the irony of this is if we're using caching to make loads in stores fast but to figure out what to load in store we have to go to DRAM then that's ironic, okay? In a very big way. So we want to have caching be fast enough that we get back our advantage for, excuse me we want to make the translation fast enough that we get back our advantage for caching and that's kind of where we need to go. And so what we're going to do is we're going to use a translation look aside buffer or TLB to cache our translations and thereby make this fast. Okay, and why does caching work? Well, you know about locality, this is 61C there's temporal locality, which is locality in time that says basically if I access something now I'm likely to access it soon again, right? Spatial locality says that if I access something I'm likely to access something close to it in physical memory, okay? That's spatial locality and spatial locality temporal locality is clearly because we have loops and all sorts of stuff where we tend to access things over and over again spatial locality is because we build objects that are in structures. And so when you access one thing in a structure you tend to access the other one soon, okay? And so we can look at caches as an address stream coming from the processor that works its way through an upper level cache and then a second level cache and so on down to memory and we can start talking about what's the total performance we get by adding caching, okay? Now, if you remember the memory hierarchy this is a good example where we have registers that are extremely fast. Then we have level one cache which is quite fast level two caches which is bigger but slower level three cache which is maybe shared on a multi-core system additionally slower and then main memory is even slower SSD is slower, disk is slower but you notice that as we get slower we also get a lot bigger. So that speed relationship between speed and size is really physics, okay? Because something that can store a huge amount of data is gonna take longer to get at than something that can only store a limited amount of data, all right? And really we want our address translation here between the speed of registers and the L1 cache, okay? But main memory which is where our page table is stored is down here. So there's clearly a problem, right? We're talking about things in sub nanoseconds versus things in multiple nanoseconds to get to or hundreds of nanoseconds to get to DRAM and so we can't have every access go to memory or we got a problem, okay? So now the time to access memory and the time to access DRAM, if I made that distinction, partially it's because kind of everything down here is sometimes maybe considered storage. I'm not, I don't wanna confuse you much though. So if I say memory and I don't make any distinctions I'll be talking about DRAM. So I didn't mean to make that distinction for you. Sorry about that confusion. So we wanna just cache the results of the recent translations. And so what that means is let's make a table that sort of goes from virtual page frame to physical frame and we'll just keep a few of them around so that we can be very fast. And so really this table which is a quick lookup table needs to be consistent with the page tables but it needs to be small enough that it's really fast so that we can work between the processor and the cache, okay? And that's the TLB. It's really recording recent virtual page number to physical page number or physical page frames or the same thing translations. If a lookup is present then you have the physical address without reading any of the page tables and you're quick, okay? This was actually invented by Sir Maurice Wilkes who is one of the famous luminaries for designing computer architecture. He actually developed this thing before caches were developed. And when you come up with a new concept, you get to name it. So if you're wondering why it's called a translation, look aside, buffer, it's because he decided to call it that and get to name it anything you want. And people eventually realize that if it's good for page tables, why not for the rest of data and memory and that's where caches came from. The question in the chat here about is the TLB stored on the processor is an interesting one. Today, absolutely, this is part of the core. There's a processor, the MMU and the first level cache. Those are all tightly bound on the same little chunk of the chip even. And I'll show you a picture. We may not get to it today where I show you that. Originally, the MMU was actually a separate chip back in the 80s and early 90s. And so it's been getting closer and closer to the processor at the same time the caches have been getting closer and closer to the processor. So when a TLB miss happens, the page tables may be cached so you only go to memory. So here's another look at this. So the CPU gets a virtual address that hands it to the TLB. The TLB says, is it cached? If the answer is yes, we go immediately have a physical address and we go to physical memory. Now here, for Sarah who asked this question earlier, here this physical memory could be the cache backed DRAM. So I'm actually explicitly not saying DRAM here, but whatever we wanna do here that's fast is this is our cache and DRAM, okay? And so if the TLB is cached and if the TLB is fast enough, then we can get a virtual address, go through the TLB quickly and look up in our cache and now we've scored, right? Cause that's fast. If on the other hand, it's not in the TLB, then we have to go to the MMU and actually walk the page table, take the resulting TLB entry stored in the TLB and then we can go to cache. And the hope is that, and then obviously if we're in the kernel and we're doing untranslated stuff, we can go around the TLB. The question is really, is this caching gonna work? Is there locality in our page translations? And the answer, if you think about it, certainly instructions have a huge amount of locality, right? Cause you do loops, the code is executed together, so you got spatial locality. So certainly for instructions, this sounds clear. Stack has a lot of locality, spatial locality, so this sounds good. And even data accesses, they don't have as much locality but they do have enough locality to make this TLB work pretty well, okay? And so, you know, just because of what I mentioned earlier objects tend to be together in physical space and so that's gonna lead to locality in the TLB. And I'm gonna remind you guys, I don't think we're gonna get to it this time, but I'll remind you next time about all the stuff you learned about caches. Caches can be multiple levels. There can be first level cache, second level cache. And so you can do the same thing with TLBs. There's nothing that says that the TLB, which is a cache, can't have first, second, third levels, okay? And modern processors have multiple levels of TLB caching. All right, so what kind of a cache is the TLB? Well, we can start talking about things like, well, it's got some number of sets. And the line size is the storage, you know, how much is in the page table entries. And so we can talk about what's the associativity of this cache, et cetera, all right? And so this is where I'm gonna remind you a little bit of some of the caching things that you remember. So you might ask the first question might be, how might the organization of a TLB differ from that of a conventional instruction or data cache? And to do that, we're gonna start by remembering what causes cache misses, okay? And then next time we'll talk more about cache structures, but there's the so-called three Cs, which are actually from Berkeley, Mark Hill, who's been a professor at University of Wisconsin for a long time, when he was graduate student at Berkeley, came up with the three Cs. And that was the compulsory misses, capacity misses and conflict misses. The compulsory misses are the first time you access something and it's never been accessed before. There's no way the cache could have it because it's never seen it before. That's a compulsory miss or a cold miss, okay? Pretty much a compulsory miss, you can't do anything about. The best you can do is pull it in from memory or if you can prefetch, that would be one way you might be able to deal with compulsory misses. Capacity misses are examples where you pull something into the cache but the cache is just too small. And therefore, the next time you go looking for it, it's not there. Conflict misses are cases where you actually have some associativity that's smaller than fully associative. And that's an example where two entries overlap each other in the cache. You pull the first one in, you access the second one, kicks the first one out and then when you go looking for the first one again, you now have a conflict miss. So in the case of compulsory misses, the best you can do there is to figure out how to have some sort of prefetching. In the case of capacity misses, you gotta make your cache bigger. In the case of conflict misses, this is the case where either making the cache bigger or increasing associativity is gonna be of importance. And we'll explore that next time, okay? I like to call this fourth C. I like to say that there's three Cs plus one. The fourth C is a coherence miss, which we will talk about a bit as well. But that's an invalidation miss where you have multiple processors. Process A, processor, excuse me, A core A reads some data. Core B writes the data that invalidates the data that core A had when core A goes to look at it again. It's a miss and it's a coherence miss, okay? So I'm gonna leave it at that since we're running out of time. But in conclusion, we've been talking a lot about page table structures, which is really what does the MMU do and how does it structure the mapping between virtual and physical addresses in memory? And we talked about this notion that memory is divided into fixed size chunks as being very helpful and that those fixed size chunks are pages. And the virtual page number goes from virtual addresses mapped through the page table to physical page number, okay? We talked about multi-level page tables, which is a virtual addresses map to a series of tables and this is a way of dealing with sparseness. And then we talked about the inverted page table as basically providing a hash table that was more closely related to the size of the physical memory rather than the size of the virtual address space, okay? Now we've been talking about the principle of locality, reminding you about temporal locality and spatial locality. We talked briefly about the three major categories of cache misses, compulsory, conflict, capacity, and then coherence for that plus one. As you can imagine, in the case of the TLB, if we miss in the TLB, that can be very expensive because we have to do many DRAM accesses potentially in a missing the TLB. So we have to be very careful to have as few misses as possible and that's gonna lead us to higher associativity or even a fully associative cache, okay? And so when we talk next time about cache organizations like DirectMap, SetAssociative, FullyAssociative, we're gonna talk about highly associative ones, okay? And we've also talked about the TLB this time, which is a small number of page table entries are actually cached on the processor so they're extraordinarily fast. It's the speed of registers. And on a hit, you basically have the full advantage of caching of the regular loads and stores being able to translate quickly and then go to the actual data cache or instruction cache. On a miss, you gotta go and traverse the page table. All right, so I think we're good there. I'm gonna let you guys go. I hope you have a great weekend and next Monday we will pick up with our brief memory lane through some caches and then we're gonna start talking about page faults and interesting things that we can do with them. So I hope you have a great day, a great evening, and a great weekend. We'll see you next week.