 Hello everybody, welcome back to CS162. We are gonna pick up where we left off and that is talking about caching just to remind you a little bit from 61C. And before we get there though, we've been talking a lot about virtual memory and one of the things I wanted to show you was what I like to call the magic two level page table which is works when you have 32 bit address space, four byte PTEs, page table entries. Then you can do this 10, 10, 12 pattern. You have the root of the page table pointing at the root page and then the first 10 bits basically select one of 1,024 entries and then that points to the second level page table. The next 10 bits point to an entry there, one of 1,024 entries and then finally we point to the actual page. And of course on a context which you have to save this page table pointer and that's it because the rest of this is in memory and there's valid bits on page table entries so you can in fact page out parts of the page table if you don't need it and so you can basically just set an invalid bit in the top level page table entry and then you can page out the second level. All right, were there any questions on this? So this works particularly well when you have 12 bit offset which is 4,096 byte pages you have and four byte page table entries. Okay, good. Now by the way there was a little question on Piazza about paging out the page table or putting it in virtual memory. I'll say a little bit about that later but I do wanna point out that the nice thing about this particular layout this is all done in physical space, physical addresses are actually in the page table in this case you're still can page out part of the page table so it's pretty close to being like virtual memory except that the actual addresses are physical ones. So the other thing we were talking about was we talked about the translation look aside buffer TLB and we've talked about it as looking like a cache and so when you get a virtual address coming out of the CPU we quickly look up on the TLB to see whether that virtual address has been cached and if the answer is yes then we can go right on to physical memory and by the way this physical memory here can be a combination of cache and DRAM or whatever and this can be very fast that can be at the speed of the cache for instance. On the other hand, if the TLB doesn't have our virtual address in it then we have to go and translate through the MMU which usually involves walking the page table once we get the result back we put that in the TLB and then we continue with our actual access and subsequent ones assuming we haven't kicked that address out of the TLB will be fast. And of course the CPU in when you're in kernel mode can basically go around the TLB to look at things in the physical pages, okay? And of course the question is does there exist any page locality that could make this work because the only reason the TLB would work as a cache on addresses is if there was an actual locality. And basically what I said here was well, instructions clearly have page locality. Stack accesses have locality and so really it's a question of the data accesses which also have locality. So in many cases they don't have as good a locality as the instruction in stack but it's pretty good, okay? And can we have a hierarchy of TLBs? Yes, so if you remember from 61C you could have a two level cache a first and a second level so you can also have a two level TLB, okay? And I'll say a little bit more about that later. So what we're gonna do now in the early part of the lecture is I wanna remind you a bit more about caching and then we'll talk about TLBs and then eventually maybe about halfway through the lecture then we'll change to how we actually use the TLBs and page table entries to get demand paging and a few other really interesting aspects of the memory system. So we at the very end of the lecture last time we were starting to remind you of sources of cache misses and I like to put these down I call these the three Cs plus one because Mark Hill when he was a graduate student here came up with these three Cs representing the source of cache misses and he went on to be a very well-known computer architect faculty member in University of Madison, Wisconsin but the three Cs that he came up with in his PhD Cs were compulsory capacity and conflict a compulsory miss is one that basically has represents an address that's never been seen before and so there's no way the cache could have had it in there because it's never been accessed. The only way getting around a compulsory or cold miss sometimes is called is basically if you have some sort of prefetching mechanism that can predict in advance then you could possibly get around compulsory misses. The other two capacity and conflict are more interesting. So the capacity miss basically is a miss that occurs because the cache is just not big enough. So you put your data into the cache the cache is giving you very fast access for a little while and then eventually you try to put too many other things in there and you kick something out and so when you miss again that's called the capacity miss because the cache was too small. Now that's a little different from a conflict miss and a conflict miss is a situation where it's not that the cache was too small but that the places, the slots in the cache that you're allowed to put something were too few and so you put something in the cache and you were happily using it but then you put another couple of things that were in the same slot and I'll show you that in a moment just to remind you again and it kicked it out and then when you go back and miss again it's a conflict miss, okay? And of course, the plus one portion of this is coherence miss. This was not talked about in Mark Hill's thesis but I like to put this down just to remind you that this is another source of miss where if you have a multi-core let's say with two processors, two cores one of them loads something in the cache and it's reading it the other one goes to write that item. The only way to keep the cache actually coherent is for an invalidation to kick it out of the first cache so that it can be written cleanly in the second one and at that point when the first processor looks again it's a miss and that's a coherence miss, all right? So do we have any questions on this? Okay, these are all reminding everybody of 61C I hope. So now when we're using a cache this sort of generic way of looking at addresses is the way that we often like to think of it. So it's full width from left to right is the number of bits of an address. So for instance, in a 32-bit processor this might be 32 bits. The bottom portion, let's say five bits represents an offset inside a cache block, okay? And so once you basically store things in the cache at a minimum size maybe 32 bytes or in a modern processor can be 128 or 256 bytes that all of those bytes are pulled in at once or kicked out at once. And so the block offset really just says, well, once I found a block in the cache where do I access it from? The next two fields, the index and the tag are somewhat more interesting. So if you imagine the bottom part is within a block then the top two pieces are about finding the block for you. The index basically is selects the set and then the tag sort of says, well, within this set of possible cache entries let's see which one might match the one I'm looking for, okay? So the block is the minimum quanta of caching. It's the minimum thing that goes in and out. Think of that as for instance, 32 bits, excuse me, 32 bytes or 128 bytes and many applications don't really have the data select field in them because actually this is sort of the whole quanta goes back and forth. But if you look at a processor accessing a single byte then you need the offset. The index is used to look up things in the cache and identifies what we'll call the set for you that are remembering and then the tag is used to show you the actual copy. So let me give you this in figures because that's always easier here. So let's look at the first things called a direct mapped cache. And a direct mapped cache is a two to the end byte cache for instance where the uppermost 32 minus end bits are the cache tag, okay? And the lowest end bits are the byte select. So let's take a look at this. So here's an example where we have a one kilobyte direct mapped cache with 32 byte blocks, okay? So if the blocks have 32 bytes in them we know that there are 32 entries. How many bits do we need to represent 32 entries? Quick do your log base two everybody, five, very good. So here's the layout of a direct mapped cache. So what I'm gonna do is my cache data has room for 32 bytes. It has room for a tag and it has room for a valid bit. And there are some number of these, okay? And what we do is we take this address which has five bits as was mentioned by folks on the chat, okay? And the cache index is gonna be used to look up in the cache and then we're gonna match the tag. So the first thing we do is we take the index and in this case there's five bits of index and so that's gonna select one of 32 cache lines. And then once we've picked a target cache line then we'll check and see does tag match or not. And if the tag matches then we know we're good to go. And at that point we can actually say well this cache line is valid so I'm now gonna use the byte select to pick which of the 32 bytes and I'm good to go, okay? And notice that there's only one slot here to put in the cache for something that has this index, okay? So that's why it's called a direct mapped cache because you give an index that only gives you one possible cache line and then that cache line or cache block is then a single tag is matched, okay? Does that remind everybody about how this works? Okay so now let's go a little further on this unless there are questions, okay? This is pretty straightforward. And again the reason it's direct mapped is there's only one full cache line that comes out of the cache index. So here's a set associative cache where we do, oh and the other thing I wanted to say about this is notice that we have five bits in the byte select and five bits in the cache index so that's a total of 10 bits which is two to the 10th is 1,024 which you all have memorized quite well now and that's a one kilobyte cache. So our cache total if you add all this up there's 1,024 bytes in there and that's because there's 10 bits that are selecting it, okay? Now let's go to a set associative cache where in general we could have n-way I'm gonna show you two and I'm gonna make the same size cache total but I'm gonna do this as a two-way set associative. So we're always gonna have in this set of examples five bits down below but now instead of five bit index we're gonna have four bits and the reason for that is we have two separate banks of cache and so that index actually selects two different cache blocks one from the left bank and one from the right bank and then once we've got it now we gotta compare the tag with two different tags and so that'll say which of these two lines that are in the cache represent what I'm looking for, okay, so this index now is selecting two things. I check the tag on both sides as a little comparator and assuming things are valid that's why I'm checking the valid bit then if I match I get a one out of here that selects for the mux and so in this example the left one matches the tag the right one didn't and I get data out, okay? So that's a two-way set associative cache two ways set associative because we have a set that's got basically it's not direct mapped, okay? Questions, good? Reminding everybody of 61C, I hope? Oh, yes, we have questions. So go ahead type to questions into the chat, please. Oh, okay, thanks. So go ahead, okay, so why do we call this the cache tag? The cache tag is basically all the other bits, okay? It's basically everything that is not the byte select or not the offset and the cache index the rest of that's the tag and you need to check that the tag matches because that's how you know that this block is the one you're looking for as opposed to representing some other part of memory, okay? And so this tag will be big it's gonna be basically everything else and the tag is not in the data it's separate from the data you could think of this as metadata on the cache, okay? We good on that everybody? So now, well, we could do this arbitrarily where we keep shrinking the index and we have more and more banks until we basically have zero bits in the index, okay? And that's called fully associative it looks kind of like this, okay? So here we have 32 places in the cache 32 blocks just like we did before but they all have a tag and they're all checked in parallel so notice we take the tag which now is 27 bits because we totally eliminated the index and we compare with all of the tags and we pick one and that's the one that's gonna select for us which cache line is valid, okay? And so thinking this as the extreme case of a set associative cache where we completely get rid of or we have one complete set which is basically the whole cache, okay? Now, can anybody tell me which one of these associativities either direct map, two-way set associative or fully associative is faster and why? Okay, I'm seeing a bunch of people saying the direct map is faster but look, it's getting all of these cache tags in parallel why is that not faster? By the way, you're right, direct map is faster does anybody know why direct map is faster? Okay, so now I'm seeing some folks are kind of on the right point here but it's, there you go last person got it here it takes a long time to propagate so you got to think like hardware not like software so first of all, what you see here on the screen here the fact that we're checking all the tags doesn't take extra time because it's all happening in parallel, okay? So you got to, you know the cool thing about hardware and you know, I'm a hardware architect so I think it's cool but is we're not, we don't have to do this serially one at a time we're doing all this in parallel so you might think off the bat that the fully associative was faster but in fact, what I didn't show you on the screen is once I've done a match then I have to take this data and I have to select from it in parallel based on the matching of the tags and so I'm selecting sort of one cache line out of 32 which is slower because it's multiple levels than this other case which I'm selecting one of two which is yet again slower than the direct mapped case where I don't have to do any selecting at all so direct mapped is actually faster in hardware, okay? And the other thing I will point out is that fully associative because it's so much bigger we're checking all of those tags in parallel actually takes up more space on the chip and as a result, there's speed of light issues and so it takes a little longer for the signals to get around and so the fully associative is actually slower as well because it's bigger, okay? So propagation speed and size of things on the chip actually can matter when you're thinking about hardware. So the thing to keep in mind is direct mapped is faster but notice there's something interesting about the direct map cache. There's a whole bunch of possible addresses, okay? That will all map on top of the same line here. In fact, anything that has the same index basically all of the, well, how big is this? This is got, well, we took out 10 bits it's got 20 bits. So there's a million addresses that all fit in the same place in the cache. And so if I access any of them that's called a conflict miss. So when we were looking earlier about conflicts I get a lot more conflicts with the direct mapped because there's only one place that I can place those million cache lines that have the same index. Whereas in the two way there's two places I can place it so that's less conflicts. And then finally in the fully associative there is I can put it anywhere. And so there are basically no conflict misses in the fully associative, okay? Good. So hopefully that's helpful. So now even though fully associative is slower it'll have less conflicts in it. And so when you start thinking about TLBs we may wanna make that decision if the result of a conflict miss requires a miss which is really expensive I might be more have more tendency to wanna go to fully associative to avoid misses even though it's a little slower and you might start thinking a little bit about whether a TLB would make sense to be fully associative because the result of missing is gonna be going and walking the page table which could be very slow. So where does a block get placed in the cache? Okay, this is gonna show you those three options here pretty clearly. So suppose we have a 32 block address space here and this particular entry is one that's gonna map a bunch of different addresses. So if it's direct mapped then block 12 basically can go only in one place. This is the address space. Here's the cache. I've got eight entries. There's only one place for block 12 to go that's 12 mod eight is exactly there. If I have two way set associative there's four sets. So there's two places that block 12 can go and in a fully assist associative cache there's eight places that block 12 can go. Okay, so this is another way to look if I have the same size cache physically the associativity this is a direct mapped two way set associative, fully associative the associativity says something about what I have flexibility to put an item from the memory space up top here into the cache. Okay, good. Hopefully this is all similar to everybody. So where do you replace on a miss? So if you're gonna load something new into the cache and you gotta kick something else out of the cache well with direct map there's only one place you can load. So in this case, if I now go to something else like address 20 and I try to load it here in the direct map cache there's only one choice of which one to kick out it would be this one. So that's easy. But if I get to two way set associative now I gotta pick one. I can pick either of them or in fully associative I have to pick one. Okay, and that's called the replacement policy. And so direct map there's only one chance for set associative or fully associative there's lots of options. Excuse me, random, least recently used, et cetera. So what you can see is that for a cache oftentimes the difference between random and LRU is very little especially when you get a larger cache. And so random works pretty well for caches when you get a lot of us when you have a higher associativity and bigger caches. And so the cost of keeping track of what the LRU is is often not worth it in a cache. That's not gonna be the case when we get to page tables in a moment in paging. The other question is what do you do on a write? So we have two options. One is write through and one is write back. So in the case of write through when I go and I'm writing data from the CPU I write it into the cache and into the DRAM. So it's writing through the cache to the DRAM, okay? That is good because it makes sure the cache has always got the most up to the data. It's bad because it's slower. So the speed of writing through to the memory is gonna be as slow as the DRAM than as slow as the cache, right? Or as fast as the cache. The case of write back I actually just put the data in the cache and I keep track of the fact that it's dirty or written and then I have to make sure, so that's very fast but now I have to make sure that when I kick it out of the cache because I'm replacing with something else I'm writing it back to DRAM or I'm gonna lose my data, okay? So pros of write through is read misses can't result in writes because the data is always up to date in the cache and you don't have to save it. Cons is the processor writes are slower. Write back, the pros are repeated writes are not said to DRAM and writes are fast. The cons is a little more complex to handle. And so the difference between write through and write back they're used in different places. So oftentimes, for instance, write through might show up at the first level cache but write back is the second and third level caches because that cost they're going all the way through can be very expensive. Okay, good. Now, there was an interesting issue that came up on Piazza after last lecture and so I figured I made a new slide to represent this just to show you I'm gonna tell you something and at the risk of confusing people I don't wanna do that but there is a difference between what's called a physically indexed cache and a virtually indexed cache. And so here's what we've been talking about this is a physically indexed cache. So what does that mean? That means what comes out of the CPU is a virtual address it goes to the TLB and assuming the TLB matches we combine the offset with the physical page frame and we go directly to the cache and we look up in the cache and it's physically indexed. So what that means is the addresses we hand to the cache are physical. If the TLB misses, well, we go to the page table just like I showed you that multi-level page table at the beginning of the lecture we go to the page table and what happens there is it walks the page table but all of the addresses in the page table including the top level is CR3 is basically physical addresses. And so we can when we look in the page table we're just going through the cache just because we do just works that way because everything's physical and the cache memory mechanism of caching DRAM in the cache that's all handled by hardware. And so nothing has to worry about that. So this is a very simple organization and by the way it's the one that the X86 uses and it's the one that we talk about in class most. The other big advantage to this which you'll see what I mean in a moment is that every piece of physical data which means something that has a location in memory has only one location in the cache. That's a physically indexed because the cache really is just a portal onto the memory. There's nothing special there. And so when we context switch we don't have to do anything special with the cache we might have to do something with the TLB which we'll talk about in a moment but we don't have to mess with the cache. Okay. Challenge to this as you can see here though is that assuming that we've made our cache really fast maybe we have three levels of cache and it's carefully tuned into the processor pipeline and all that sort of stuff to make it fast. The problem is that we have to take the CPU virtual address pipe it through a TLB before we can even look at the cache and so this TLB needs to be really fast. The other option which came up in Piazza was this idea of a virtual cache. This is more common in higher end servers that aren't made out of x86 processors. I would say this is less common these days but you take the CPU and the first thing you do is you just look up in the cache and that looking up in the cache since the cache has virtual addresses to the index is just fast. You just put the virtual address in there you either get it or you don't. If you miss, then you got to do something. When you miss, then you got to look up in the TLB and notice that I intentionally put the cache in the TLB kind of on top of each other because you can actually be looking up in the TLB at the same time you're checking in the cache. So you can overlap that and then we look up in the TLB and assuming that the TLB hits then we can just go to memory, find the physical data and put it back in the cache. These arrows are all for addresses which I don't have a reason I don't have a data arrow coming back and then we're good to go. If we miss in the TLB then we got to do something and if we want the same advantage of caching the page tables then we can have the page tables made of virtual addresses and then we do a recursive walk of the page table by looking up virtual addresses which may in turn cause TLB misses which cause page table walks and the key there is you got to be careful so that ultimately the root TLB pages are pinned in a way that this finishes, okay? And so that's a little more complexity there, okay? So the challenges of the virtually indexed CAS well, first of all, the benefit is this can be blazingly fast cause there's no TLB lookup between the CPU and the cache. The challenge though is that if you think through this a little bit, you'll see that the same data in memory can be in multiple places in the cache because remember every process has its own notion of zero for instance, I've said that several times this term and therefore if two processes are mapping to the same memory then they'll actually they'll be in different virtual address spaces and that could be messy, okay? And in fact, what you need to do with this cache layout is when you switch from one process to another you actually have to flush the cache, okay? And that instance where you might, and that instance where you might have the same data in multiple parts of the cache is when you actually tag the cache with process IDs we won't go into that here but now you've got aliases where the same data can be simultaneously in different parts of the cache and that can lead to all sorts of consistency problem, all right? So I just wanted to make sure everybody saw these two if that was too much information we're gonna stick with the top one I might, who knows I might actually ask you a question about this lower one but for the rest of the lectures we're gonna be mostly talking about this physically indexed one up here it's popular because of its simplicity but we need to figure out how to make the TLB fast, okay? So why is the page table virtually addressed here? Well, just because if we want to cache the page table which we do because remember we're walking through a bunch of memory we don't wanna go as slow as DRAM so we wanna be in the cache but to be in the cache that means a page table has to have virtual addresses in it rather than physical addresses like the ones we've been talking about, okay? And that because that's the only way that we can come back through the main cache, okay? Now we could pull some other tricks where we have a separate cache just for the page table or we have a separate pipeline that tries to make accessing DRAM faster than otherwise but this simplest idea here is to make this virtually addressed, okay? And it's actually not so crazy and it's done by definitely some server machines the trickiest part about this is not what we just said there. The trickiest part here is the aliasing of the cache and that's kind of it gets messy quickly, okay? So the top cache doesn't have to be invalidated on a process switch because this cache up here is purely just a portal into the underlying memory. And so one location, one memory location one location in the cache, one location in the memory there's never multiple cache locations to go to memory, all right? That's another reason this is a simpler organization. We do need to do something about the TLB though we'll have to come up with that in a moment. So it'd been a strivia as we discussed in the chat before lecture a little bit it does seem like midterms keep coming and I'm sure you're getting them in all your other classes too but yep, we're coming up on midterm too on Thursday at 1029. And topics are basically everything up to lecture 17 I've listed a bunch of them here. Basically it's just everything up to lecture 17. There has been a discussion in Piazza about whether these exams are cumulative or not. The answer is we focus on the new material but don't forget everything you learn the first 30 of the class because you never know we might ask you something that meant you had to remember how to synchronize or something like that. So don't forget everything you learn but most of the material will be focused on these new lectures. The other thing is we're gonna require your Zoom proctoring to be set up and so I think what we're gonna be doing is generating your Zoom rooms for you but make sure you got your camera and your microphones and your audio all set up in advance because we're actually gonna be requiring that during the exam, all right? There's a review session on the 27th, Tuesday and there is a Zoom room just like the last one. Neil will put it out. He's got all the information for it. I don't know that we have yet but we will soon. I guess it was a silent announcement a while back but I do actually have office hours these days from two to three Monday to Wednesday. There's a Zoom link that's posted both on the course schedule and I think I have a pin Piazza statement about that but definitely feel free to come by and chat about computer architecture or life, the universe and everything or quantum computing or whatever you like. Probably don't wanna come with questions about detailed code aspects of your projects. You should stick with your TA office hours on that because this is more for general discussions and interesting questions and if you do wanna come with whatever questions you have and if you wanna set up something private with me as well we can do that, all right? And then I put the best for last, the election's coming up, please don't forget to vote if you have the ability to vote. This is one of the most important things you can have as a US citizen and take advantage of it. Whatever you vote is fine but today is actually the last data register if you haven't done that. So you do need to do that, thanks for pointing that out and be safe when you go, if you go in person or fill out your forms and mail them just don't put them in a fake ballot box. Make sure that somebody like the post office actually gets it and what's cool in California is you can sign up and you'll get texts as your vote works its way through the system which is also pretty cool. Post office as we received it and then the state administration says that we've got it it's ready to be counted and so on so you get to find out about that, okay? Alrighty, please vote. So let's go back to some questions now and I'm gonna be talking physically indexed caches again. So here's our schematic of what that looks like and we got CPU going to TLB, going to cash, going to memory and the question is kind of what TLB organization makes sense here. Clearly the TLB needs to be really fast because it's in the critical path between the CPU and the first level cash, okay? So this needs to be really fast. This seems to argue for direct mapped or really low associativity in order to make that fast. Now you have to have very few conflicts though because every time you miss in the TLB you have to walk the page table which even if the page table is cashed could be four, eight memory references just to do a single reference. So you don't wanna miss in the TLB when you can which means it's few conflicts which pushes your associativity up and so there is this trade-off between the cost of a conflict is a high mist time but the hit time is slow if it's too associative and so there's a lot of tricks that are played to make the TLB fast. And this is kind of a 152 topic but I thought I would say a little bit about this. And the other thing is we gotta be careful kind of in the TLB what do we use as an index in the TLB? If we use the lower bits of a page as an index in a lower associativity TLB then you can end up with some thrashing and that could be a problem. If you use the higher order bits you'll end up with a situation where big parts of the cash are never used. So you gotta be a little bit careful about this. The TLB is mostly a fully associative cash although these days, I'll show you in a second these days for instance the X86 has something like a 12 way set associative first level and then it's backed by a large second level. So how big does it actually have to be? So it's usually pretty small and that's basically for performance reasons. 128, 512 entries are pretty standard. One of the reasons that there are more entries these days than there were say 10 or 15 years ago is that people use a lot of address spaces including when micro kernels which tend to have a lot of address spaces. And so there's a lot of TLB entries you might need. The other problem is as your DRAM and your overall memory get really large then there's gonna be a lot of pages involved and so you need more TLB entries. So that combination of a lot of different address spaces and a lot of memory kind of push the TLB up and that's why in fact the modern systems tend to have a two level TLB which is a slightly smaller one at the top level and a much bigger second level one to try to make things as fast as possible. So small TLBs often organized even as a fully associative cache. This is often called a TLB slice where you have a little tiny TLB that's direct mapped at the top level and then you have a second level that's a little bit much bigger. If fully associative is too slow then you do this two way set associative called a slice. Here's an example of what might be in the TLB. It's a fully associative lookup for instance for the MIPSR 3000 that's very old processor but it's easy to look at here. You might have a virtual address, you know a physical address and then some bits for lookup and the trick is the virtual address comes in you fully associatively look it up to find that tag and then you get the rest of this represents the match and the TLB. The thing I wanted to show you about the R3000 is the R3000 handled the question of how to deal with a fast TLB in an interesting way that was kind of possible back in the old days which is basically you need a TLB both for the instructions and for the data and what they did was they arranged the TLB lookup was a half of a cycle. And so although you in 61C learned about a five stage pipeline with fetch to code execute memory and write back here are the actual cycles up top. And if you notice what really happens is the first half of the instruction fetch cycle is actually a TLB lookup then there's a whole cycle that overlaps the last half of instruction fetch and the first half of the code for the eye cache lookup. In the case of the data TLB what happens is the address is computed in the first half of the execution cycle and then the TLB lookups in the second half. And so they were able to confine the speed of the TLB to half a cycle and be able to deal with that by rearranging things in the pipeline a little bit. But in general, that's much harder to do these days and there are many more pipeline stages as I'm sure you learned. So the thing to ask yourself is if we're gonna go with a physically mapped cache and we're going to not be able to split cycles that way then what are we gonna do? And really as we've described this we're taking the offset and copying it down of course but then we're taking the virtual page number looking it up in the TLB and then copying the physical page number into the final address. And the question is how do we make this faster in general? And the answer is, well one trick is take a look at this I'm showing you the virtual address and the physical address, okay? And that physical address I'm showing you is actually split up into a combination of the tag and the index and the byte, remember we just showed you that and the virtual address is the virtual page number in the offset. And if you can arrange that the offset overlaps the index and the byte in the cache then you're golden because this offset doesn't get translated by the TLB so you can pipe this offset into the cache and immediately start looking up the index while you're looking up in the TLB so you're overlapping the cache access and the TLB even though logically you have to do the TLB lookup before you do the cache access, okay? So this is the example of the tricks here's a picture of that. So in fact when you take, this is with two byte with four byte pages but if you take a look, excuse me, four byte cache lines, 4K cache, 4K pages, what you see here is we take the page number, we start looking it up in the TLB and meanwhile we use the 10 bit index to look up in the cache thereby giving us a four byte cache line, okay? And you can rearrange this any way you like but so we can basically then get out of that cache we can get our tag and out of the TLB we get the tag which is now the physical page number we can compare the two and we can look at this kind of after we've done both cache access and TLB so there's how that parallel thing works. Isn't that cool? Galaxy brained as it says on the chat, yes. Now if things are 8K in this organization this is a little tricky but there's all sorts of tricks that people do in fact they might divide the 8K cache into two banks so if you lower the associative or if you raise the associativity to two ways set associative this still works, right? There are other things you can do you can pull tricks where you look up part of the cache access and you finish the rest of it later and there's all sorts of stuff, okay? Now the other option of course is if you really want a really big cache and you're running into this problem then you might actually go back to your virtual cache in fact Intel has managed to do this very well with all sorts of tricks in how to make the TLB work fast, okay? Good. So here's the actual previous generation processor that's pretty cool. If you look at the front end instruction fetch and the back end data what you'll see here is here's the data TLB here's the instruction TLB and these are first level caches as TLBs that are basically backed by the second level TLB cache. So when you miss in the first level TLB you first look in the second level and it's only when you miss in the second level which is by the way much bigger that then you do your table walk, okay? And so this is a way to make that faster. And so that particular processor for instance here's an example where the L1 iCache is 32 kilobytes the L1 data cache this is level one is 32 kilobytes second level is combined megabyte third level cache is actually 1.375 megabytes per core and oftentimes you get like 50, 58 cores or whatever you can get a lot of cores in these particular chips. And so the TLB just to finish out what we've got the level one instruction TLB is 128 entries eight way set associative the level one data TLB is 64 entries four way set associative and all of this is backed by the second level shared STLB which is 1536 entries 12 way set associative. And so you can see how they're pull all sorts of tricks to meet their pipeline timings and thereby basically keep a direct mapped excuse me thereby keep a physically page indexed cache which is much simpler than dealing with the virtual one. Okay, a core in this way in this case by the way is a combination of a processor and a slice of the third level cache and some cache consistency hardware and some bit of networking that's often called a core and then the processor core is the small piece of that and then together those are all put together on a chip to get multiple processors. Okay, so you could either think of a core as a processor or a processor plus some extra stuff that's associated with that processor. Okay, and this by the way is the processor portion of a core there's a whole bunch of other stuff as well. So what happens on a context which in general you gotta do something. So assuming for a moment that we have physically addressed caches we don't have to mess with the cache so that's cool but the TLB still has to do something and the reason for that is really that we just change the address space from one process to another and so all those TLB entries are no longer valid okay because G that for process A it mapped zero to a particular part of the address space and you switch to B and now zero gets mapped to a different physical part of the physical space and so we gotta do something. We have a couple of options. One is you can flush out all of the TLB with special flush instructions as soon as you context switch. A lot of early processors and by early, I mean as early as seven years ago, you had to do this. So every context which you actually had to flush the TLB out and you can see now why a process switch from one process to another might be actually expensive because you're flushing a bunch of stuff. Now more modern processors which Intel has made much more common these days actually have a process ID in the TLB and so when you switch from one process to another and you switch an ID register in the processor then it knows automatically to ignore the old entries of the TLB from the other process and put new ones in there. And by the way when you switch back to process A from process B, it's quite possible that a lot of the TLB entries are still there. So this is a much better sharing of TLB amongst multiple processors and it has the advantage that you don't have to flush the TLB out when you go from context switch. Now if the translation tables themselves change which basically means the page table changes then you got to do something. And in that case you really got to invalidate the TLB entries and I'll show you more about that in a moment. And that's because the TLB is a cache on the page table and if the page table gets changed in software you got to do something about the TLB which is still in hardware and this is typically called TLB consistency. And of course with a virtually indexed cache you also have to flush the cache or you have to have process IDs in the cache. So that's also potentially tricky. All right, now I'm gonna show you this in a second. So let's look at this particular example I'm gonna show you here and there's a question in the chat about the difference between the TLB and the page table and hopefully this will help a little bit. So the TLB is a cache now on the page table. But if you look here, let's put everything together. So our virtual address has and this is gonna be that magic 10, 10, 12 layout. So our virtual address has this piece in red is the virtual page number is gonna have two pieces to it. The offset is of course gonna get copied to an offset in the physical address. So this is the easiest part of translating from a virtual to a physical address. So always remember that, okay? That's a way to get yourself some points, right? So now let's look at what happens here. So physical memory is over in the mint green on the right there. So we have a page table pointer pointing at the first level page table. We take the index, that lets us get a page table entry. That page table entry is gonna include a physical pointer to a next level page table. And so then we're gonna use that to look up the physical page number, okay? And now that's our physical address. So what are we gonna do with that physical address? Well, the physical page number is actually pointing at a page, which is for instance, if this is 12 bits, this is gonna be four kilobytes in size. And then the offset will pick a chunk inside the page. And that'll be the thing that we're accessing. So I hope that everybody kind of sees the analogy here right off the bat between paging and caching, right? Cause remember in the case of a cache, we looked up the cache line, this light blue thing. And then we had the offset to look up the dark blue thing. This is almost identical idea except that these offsets are bigger than typical cache line, okay? And we're gonna bring the cache into this in a moment, just everything in one figure, right? So now this was all fine and dandy, okay? Except that we had to walk our way through this page table which is expensive. So now this in the question that was in the chat is what's the difference between the TLB and the page table? Well, this is the page table. And to access the page table, I have to do this lookup. This is a tree of tables and I have to do multi-level lookup to get this physical page number, which we do not wanna do, okay? Oh yeah, and so the question here is how much of this offset will we use? That depends on the loader's story you're doing. So it could be 32 bits, it could be 64 bits, it could be a byte, it could be any number of things. And that's gonna depend on what we're doing. Now we're gonna get to the cache in a moment in which case we might pull a chunk of this into the cache. In that case, it's gonna be a cache line chunk, okay? All right, so let's all offer a second. I'm worrying about buffering. So this is the page table, but this is too slow because just to do a load, we had to go through multiple lookups in a page table and this can be four to eight levels. So that's not good. So I'm graying this out. So now we wanna get from this virtual address to this physical address quickly. How do we do that? We put a TLB. So what we're gonna do is we're gonna remember this translation, okay, down here in the TLB where we actually take the virtual address, this red thing, and we put that as the tag and say a fully associative cache. And this yellow thing is gonna be the physical page. And now we've just short circuited the multiple memory accesses by just taking the virtual address quickly looking it up and thinking this is a hash table if you want, a fully associative lookup that gave us our physical page and we're quick, okay? Now, so hopefully this shows you the TLB is a cache on this more complicated page table lookup, but this doesn't show us the actual cache itself, which is the data cache. So this is the TLB, which is a cache. The real cache though, the data cache is this one. And remember this we talked about at the beginning of the lecture. So here, how are we gonna look up the data in the cache itself? Well, this is the physical address that's been translated, but that physical address can be divided up into tag index byte. Everybody remembers that from the beginning of the lecture, right, this is 61C. And so that index would be used to select a cache line. The tag will be checked and assuming that matches the byte index will then decide which part of that cache line we want, okay? And so in that instance, in a prior access, we've taken a cache line block out of physical memory here, put it into the cache, and now we go ahead and do the actual access out of the cache, all right? So notice there are two caches here. So there TLB and the regular data cache. So I wanna pause just long enough to make sure everybody has absorbed that. So the dark blue on the right here actually corresponds to, I guess you could say it corresponds to either this dark blue piece here, if I'm looking at only a little tiny bit of this physical memory, or you could say maybe it corresponds to this whole cache line. But actually, if you look at this offset as being down to the byte, and there's a particular byte here, then this byte, the dark blue piece could be the same as this dark blue piece. So during a context switch, where do we flush the TLB2 outer space? It goes into the bit buckets and little ones and zeros go draining out the waste bucket in the back of your computer. No, basically when you flush the TLB, you just throw things out here. Because if you think about this thing in the TLB, let me back this up a little bit here. If you look at this, this cache is in some sense a read-only cache on top of the page table. So you can throw out the whole TLB at any time and you can always be correct. Because there's no up-to-date information in the TLB. Except when we start talking about the clock algorithm, there are things like the use bit and the dirty bit that do have to be maintained to some extent. Okay, all right. Now, where are we storing the tag plus index part? This isn't stored. What this is, is this is just a reinterpretation. I take these 32 bits and I mentally divide them up into these three pieces so I can do my cache access. All right, we good? So I thought I'd throw that all together as you saw it all in one place. Good, now, let's move up a level. So we've been talking about the page table translating from virtual to physical, but we haven't talked about what happens when there isn't a translation. Okay, and what does that mean? That means that there isn't actually an entry in the page table for every possible mapping. And therefore, some of these are marked invalid, which means that we're gonna get a fault. We're gonna try to do this access. We're gonna get as far as looking the last level page table or maybe the top level page table we'll encounter an invalid bit. And at that point, the hardware is like game over. I can't do anything about this, okay? And that's called a page fault, all right? And what happens there? Well, the page table entry is marked invalid, okay? Or in the case of Intel not present, right? And at this point, that's a problem. Another problem could be we try to do a write but the page is marked as read only. Some other access violation or it doesn't exist at all. These are all possibilities. And what's gonna happen there is we're gonna cause a fault or a trap. It's a synchronous fault, which is not an interrupt. The interrupts are asynchronous. This is a synchronous fault because memory access failed. And at that point we have to do something to move forward, okay? And in that case, other page faults actually engage the operating system to fix a situation and try to retry the instruction. Now, good question. What's the difference between a page fault and a segmentation fault? So typically a page fault occurs when you try to access a page in the page table and it's not currently allowed, okay? And what happens there is the operating system gets a page fault and it can now do something. Just because you get a page fault doesn't mean that you can't do something. You might pull something off of disk. You might do a copy on write operation. You might do any number of operations. So a page fault isn't necessarily fatal. So that's one thing. Oftentimes a segmentation fault is thought of as fatal. So when your program dies with a segmentation fault, it's historical, it's called a segmentation fault because typically you're working outside of the segment. Well, it is exactly like memory segmentation, but when you try to go outside the segment, if you remember back to a few lectures ago, that can't go on and that's a segmentation fault, okay? Now in modern systems, you could get either a segmentation fault in the kernel or you could get a page fault depending on what the current situation is. You're gonna get the segmentation fault first because that's the first thing that's checked. Are you, is your address within the segment? And then from there, you check the pages, okay? All right, so segmentation faults either generically talk about game over, kill your program or do talk about a fault that's because you've violated some about the segments. All right, so let's not dwell on that too much. We'll mention it again in another possibility here. This is a fundamental inversion of what we're talking about here, the hardware software boundary because the hardware is trying to move forward and it can't move forward until software runs, okay? So that's a little different than we normally think, right? Normally software has to use hardware to run. Here the hardware stops and says, I can't do anything and the software takes over. The thing that's important is when you get a page fault, that's the hardware saying, I can't do this. You can't then in the handler cause other page faults or at least you can't do it recursively in a way that will never resolve because in that case you go into an infinite loop and that's bad, okay? So we have to be a little bit careful about page faults especially in fault handlers. So let's look a little bit further on an idea that we might do with this page fault idea, okay? So the demand paging idea that I'm gonna talk about next harkens back to the cache. So here's a figure I've shown you many times over the course of this last several lectures at least which is the idea that we have many levels of caches in the system, okay? Modern programs require lots of physical memory. Memory systems have been growing at a 25%, 30% per year for a long time but they don't always use all their memory all the time. This is the so-called 90-10 rule which means that programs spend 90% of their time with 10% of their code or 90% of their time, 10% of their data, not always a perfect statistic but it's a way to think of things and it would be wasteful to basically require you to have all of that huge amount of storage that you're not using in instructions say or in libraries that you've just linked but not used in the fastest memory. And so this memory hierarchy is about caching. It's about making the speed of the system look like the smallest fastest item but have the size of the largest item. Now the largest one I show here is tape. I know that's probably way before your time but instead of just disk we could have SSD and then spinning storage and then there's still people still use tape in some very rare instances. But so the idea here would be can we use our paging? Tape is much larger than disk potentially although disks have been getting awfully big and so pretty much tape is a legacy thing now but the trick we're gonna do here is we're gonna use main memory which is smaller than disk in almost all cases as a cache on the disk. And so we're gonna think of the image of the process as large on disk and we're gonna pull in only the parts of the process we need into physical DRAM. And what we're gonna do is we're gonna try to make it look like even though our image is huge we're gonna get the speed of the small thing with the size of the big thing. Let's say disk for instance and we're gonna do that by using page faults cleverly. We're gonna basically say we're gonna start with all of the pages invalid or a small number of pages invalid for a process and then as we start executing we're gonna get a page fault for a page that's not currently in memory and then we're gonna pull it off of disk into memory and then mark the page table entry is valid and we're gonna keep going and if we do this correctly we'll eventually get the working set and I'll show you what that means in a moment of the process into memory and so that only those things that are actively being used have to take up space in DRAM. Okay, we call that demand paging and so by the way, caching is called different things depending on who does it. So when you hear caching typically and especially in 61C, you were talking about taking the DRAM and using the second third, first whatever levels of cache as a higher speed version of the DRAM and all of that's done in hardware. We are now gonna do the same idea where we have disk as our backing store and we're gonna use the operating system to pull in the parts of the disk that we need and put them into DRAM and mark the page tables appropriately so we can get the same idea of caching but it's gonna be done in software rather than this other typically called caching that's done in hardware, okay. So let me just show you an idea here. So here we're executing an instruction, we get a virtual address, we go to the MMU. In the good case, we look up the page in the page table and hopefully this is in the TLB so it's fast although that's not relevant for this current discussion. This all works, we basically go to memory and we access that instruction. So that's good because we've basically put the current instruction we want in our DRAM. Now, however, it's possible, we go to this and it's not there, all right. And in that instance, what would happen is we get a page fault because we try to look it up in the page table and it's not working, basically come back invalid which case we get a page fault and we enter the operating system with an exception, okay. And in that case, we have to deal with the page fault handler, what's the page fault handler do? Well, you're all very familiar now with what happens when we do a system call into the operating system, same idea with the page fault handler, it's gonna be running on the kernel stack. It's gonna identify where on disk that page is and it's gonna start loading the page from disk into DRAM and it's gonna update the page table entry to point at that new chunk. And then later we're gonna run a scheduler, put this back on the ready queue which will then try the instruction again which in this case will work and we win, okay. So this is demand paging, all right. And notice by the way that I sped this up a little bit, obviously when we start the million instruction load off of disk into memory, we have to put that process on some weight queue and then when the data is actually in memory, we wake it up from the weight queue and that's the point at which we can fix the page table and then put the process back on the ready queue so that it can run, okay. All right, so this is called demand paging, questions. Now let's look a little further along here. So demand paging is a type of cache and so we can start asking our questions. And by the way, why is this a cache? Look, when we missed originally, it wasn't in the cache, that was a cache miss and then we put it into the cache and now we have a cache hit, right. So this is just like a cache, it's just being done in software and we're pulling things in, not in cache line size things but in pages off the disk. And so that's our first question, what's the block size, one page, which is now four kilobytes, not 32 bytes or 128 bytes, okay. What organization do we have here? Well, this is interesting. I hope you guys can all see this as a fully associative cache because we have an arbitrary mapping from virtual to physical because of the page table. So the page table gives us a fully associative cache because we can basically put that page pretty much anywhere we want in the memory, okay. How do we locate a page? We first check the TLB, then we do a page table traversal and then hopefully we found it and if we still fail after all of that, then we might do something more drastic like kill off the process. So earlier we were talking about in a cache we could think about randomly replacing or we could think about LRU. The question is what's the replacement policy? And this is actually gonna require a much more interesting longer-term discussion which we're gonna start tonight and then we're gonna move our way next time in more detail. But is it LRU? Is it random? It turns out that the cost of a page fault that has to go to disk is high, right? A million instructions. So we gotta be really careful when we have our DRAM is full of other pages. We have to be very careful of which one we choose to replace. And so the replacement policy, we can't just say random works pretty well or LRU works pretty well. We have to do something else. And it turns out we're gonna want something like LRU but we're not gonna be able to do that well and we're gonna show you that in a little bit. Maybe next time. So what happens on a miss? Well, on a miss you go to the lower level to fill the miss, which is you go to disk. What happens on a right? Well, here's a good example. Earlier I talked about right through versus right back and maybe at the top level cash you do right through to the second level cash. Here you absolutely cannot do right through. Okay, why can't we do right through for a paging? Anybody guess? Yep, that means that a right from the processor which is supposed to be really fast is gonna take a million instructions worth of time. So absolutely not, no right through to disk. What we're gonna do instead is right back and that means we're gonna have to keep track possibly in the page table entry which pages have been written to so that we know which ones are dirty and have to be written back to disk. They can't just be thrown out. They actually represent up-to-date data. So now we wanna provide the illusion of essentially infinite memory. So in a 32-bit machine that would be four gigabytes and a 64-bit machine that'd be exabytes worth of storage. Okay, and we wanna do that by using the TLB and the page table with a smaller physical memory. So I'm showing you here four gigabyte virtual memory and a smaller 512-megabyte physical memory that represents only the data that's in memory and has to be shared amongst a bunch of different processes. Okay, and so the page table is gonna be our way of giving that illusion of an infinite amount or a maximum address-based filled amount of memory. Okay, and the way that works is in the TLB mapping through the page table is gonna say that while certain items are in physical memory, others are not and they're on disk. And it's gonna be up to this page table to help us with that mapping. Okay, now the simplest thing it's gonna need to do is do a really quick mapping for those things that are actually in DRAM. We talked about that earlier. For the things that aren't on DRAM, then there's more interesting flexibility here. One is that maybe you put something about which disk block it is into the page table, which you're mostly not, page table entry, which you're mostly not using. The other would be maybe a special data structure in the operating system. These are all options for locating on disk once you've missed in the TLB page table combination. Okay, so disk is larger than physical memory, means that the virtual memory in use can be much bigger than physical memory and combined memory of running processes much larger than physical memory. More programs fit into memory, more concurrency. And so the principle here is a transparent level of indirection of the page table, supporting flexible placement of where we wanna put things the actual physical data and which things we wanna have on disk or not. Okay, and that's gonna be what we're going to try to figure out how to manage now. So I've up till now, we've talked about all the mechanism, TLBs, page tables, et cetera, to make this work. And now we need the next level, which is what does the operating system do to manage all those pages well? So remember the PTE, I showed you this one earlier. This is an example in our Magical 101012 example. Here's a 32-bit page table entry, the top 20 bits are the physical page frame number, either of the page itself or of the next level of the page table. And if you look down in here, you see that the valid or present bid is down at the bottom. We have whether it's a writable page, whether it's a kernel or user page, whether it's cashable or not. And as we move our way up, we have some interesting things like the D-bit, which is, is it dirty or not? And the D-bit is gonna get set in the page table entry when we modify the page, okay? And in that case, when we're about ready to replace the page, we'll have the D-bit to tell us something about, do we need to write it back to disk or not? The A-bit or the access bit is another one that we're gonna use for the clock algorithm. We'll also talk about that in a bit. So some mechanisms for demand paging. So the page table entry makes us do all sorts of, or gives us the options to do demand paging. So valid means the page is in memory. The page table points at the physical page, not valid or not present, page is not in memory. You're welcome to use the remaining 31 bits, at least in the Intel spec for anything you want. And one thing you could use it for. So notice that if this guy is zero down at the bottom, these remaining 31 bits could be used by the operating system, for instance, to identify a disk block if you wanted or something else in internal data structures that you happen to have in memory in the kernel. If the user references the page with an invalid page table entry, memory management unit's gonna trap to the operating system giving you a page fault. What does the OS do? Any number of things. Chooses an old page to replace, for instance. If the page is modified, so D is one, it's gotta write the contents back to the disk. It's gonna change the page table entry and the cache TLB to be invalid. So notice that what we're doing here is we're picking a page to kick out, maybe writing it back to disk. At that point, that page is gonna go from valid to invalid. So we're gonna have to go and modify the page table entry in the page table to be invalid. And since the TLB is a cache on that, we're gonna have to invalidate that TLB entry. Otherwise, we're gonna end up with the TLB giving us the wrong answer. It's gonna claim that a page is valid when it's been overwritten. And then we're gonna load the new page into memory from the disk. So we caused this page fault because we're trying to access something that wasn't there. We pull it in off of disk, put it into memory, overwriting the one we got rid of. We update the page table entry for that new page to be valid and point at the physical page we took. We invalidate the TLB for the new entry. Why? Well, because that TLB was invalid when we got the page fault. And then finally, we continue the thread from the original faulting location and we're good to go. And so all of these things here are basically what makes the thing we're talking about here a cache. This is how the combination of TLB and page table entries can be turned into a demand page caching mechanism. When the thread goes to execute again after it was pulled off the ready queue, the TLB will get reloaded at that time because since we invalidated it up here, the first thing that happens is you get a fault in the TLB, you rock the page table, you get the new one. Okay, so now the good question in the chat here is so in the program that was referencing that old page starts running again, what happens? It causes a page fault and it picks a different page to replace and pulls it in off the disk. Okay, so the crucial thing here is to not have so much memory and use that the only thing you're ever doing is pulling pages off of disk, it's called thrashing. If you do that, and in that case, there's no actual work that happens, only paging. Okay, and that's the worst possible scenario. Assuming that we're not at that point, all that we've done by pushing out a good page and that's where replacement comes into play is we're readjusting the working set of the running processes to be the things that are actually in DRAM so that we get really fast access for all those pages that are actually being used and the hope is that very rarely do we kick a page out of memory that's inactive use by a process. Okay, so hopefully that case that is being worried about in the chat here is not too frequent, otherwise we're in trouble. And of course, when pulling pages off of disk and so on, we wanna run something else because it's got a million instructions to wait. All right, now, so where does the sleep happen? So the sleep is gonna happen on the disk weight queue. Okay, so the process that's basically trying to page in off of disk is gonna, it's TCB is for that thread is gonna be placed on the weight queue for that disk and it will get woken up when the block comes back from the disk. And when does the sleep happens after you start everything in motion and get the access starting on the disk. And in that point, when now it's all up to the disk, it's that's the point at which you put that thread on the weight queue, all right. And then it'll get pulled off the weight queue when the thing comes back. So now the origins of paging here are pretty clear. You had, this is back in the really, really old days where you had a really expensive piece of equipment, many clients running on dumb terminals, small amount of memory for many processes and disks have most of the storage. So in that scenario, what we wanna do is keep most of the address space on disk and try to keep the memory full of only those things that are actually needed because memory is incredibly precious. Okay, so a lot of the original paging mechanisms came up in that environment, okay. And you're actively swapping pages to and from the disk. Today, we're very different, right? We have huge memories. We have machines that rather than being owned by one, owned by one organization and used by many are typically owned by one user and basically working many processes on behalf of that user. And so when we were talking about all the different ways of scheduling, part of those different ways were reflecting the changing needs of resource multiplexing and the same thing is true of paging, okay. If you take a look, for instance, you do something like here's a PSAUX type thing you might get off of a Unix style operating system. What you'll see here is the memory. If you look here at physical memory, we've got about 75% of its use, 25 is kept for Dynamics and a lot of it's shared. So there's 1.9 gigabytes shared in memories, distributed memories, you can see that up here, distributed memory, excuse me. A lot of it's shared in shared libraries. That's what I meant to say. That's about 1.9 up here. And so really there's a lot that's working on behalf kind of of one user. And so we're not so much optimizing by trying to get things out to disks that aren't used as quickly as possible. We're trying to keep things in and we have a lot more memory to work with. And so we wanna keep that in mind when we start talking about policies for paging. So there's many uses for virtual memory and demand paging. We've already talked about several of them. You can extend the stack, you can allocate a page and zero it. You can extend the heap, you can do and process fork, you can use the page table by setting all of a copy of the page table to read only, then only when somebody modifies a page, you actually copy it. And so fork is a lot cheaper because of the page table. We've talked about exec where you can basically only bring in parts of the binary that are actively in use. And you do this on demand, okay? We haven't talked about M-Map yet, but we'll talk about that for explicitly sharing regions to access the file, access shared memory, almost as a file. We'll talk about that in a little bit. And you can also access the file in memory too. So let me just show you if you'll bear with me a little bit, we'll finish up kind of here. Maybe we won't plow into too much new mechanisms after this. So classically, you kind of took an executable, you put it into memory, and you did this for a small number of processes. If you look at what we're gonna do while we're doing that today is we take a huge disk, we've got our executable, which has got code and data and it's divided up into segments. And we loaded it and we wanna map it, excuse me, into a process virtual address space. And that's gonna happen because we're gonna map some of the physical pages off the disk into memory, okay? And so if you take a look, for instance, when we start up that process, we have a swap space, which is potentially set up that represents the memory image of this process. And notice that that memory image on disk mirrors what we have in our virtual address space, okay? And the page table points at the things that are actually in memory, for instance, the dark blue process here, okay? And for all the other pages, the operating system is gonna have to keep track of where on disk they are so that if the user tries to use other parts of the virtual address space that aren't currently in memory, then it needs to know how to pull them in, okay? And as I mentioned, you could use the page table entry partially for that, if you like, or you could use a hash table in the operating system. So really, here's an example of the page table entry, actually those extra 31 bits pointing back to where these are on disk, and they could actually store disk block IDs or something like that, okay? Now, what data structure maps non-resident pages to disk? Well, it turns out, for instance, in Linux, there's a fine block, which can be used to basically take the process ID and the page number you're looking for and give you back a disk block that can be implemented in lots of different ways. As I said, you could store it in memory, you could use a hash table, like an inverted page table, just for that data structure. And you can map code segments directly to the on disk image so that for instance, when I start this thing up, I don't even have to load the code into memory. What I do is I point the virtual address space entries for that process directly at disk, and then as soon as I start page faulting them, they get copied into memory from disk, and I don't even have to do anything on starting up practically, okay? And I can use this to share code segments with multiple instances of a program across different processes. And so I wanna show you an example here. Here's an example where the first virtual address space is the dark blue one. It's got some code, which is gonna be shared, that's the cyan color, and then there's gonna be a green process. I'll show you in a moment, okay? Here's the green one. So it's got its own image on disk. And if you notice, it's got sort of pointers back to where things are on disk, okay? And the code here is pointing to the same disk blocks, which can actually be part of the original image, your linked image that you stored, the eight out out that you stored on disk. Both processes can backlink to that, okay? And so that way they both can end up having their code point at the same part of memory. That's much more efficient since they're the same process, or excuse me, the same program that's running in different processes, all right? Here's an amusing thing to notice. So we have the active process and page table here might be the blue one. And what happens when we try to execute an instruction or look at some data that causes a page fault? Well, we go to the page table entry, it's the data's not there, we cause a page fault, we figure out where the data is and we start the load happening, okay? And that load is started on the disk controller. We'll talk about that in a couple of lectures. Meanwhile, we start the active process, we point it at the other guy and he runs and eventually the data comes in, okay? It gets mapped into the page table entry, we restart this page, this process and we're good to go. Okay, and at that point, we allow the light green one to run for a while and then the dark one was put to sleep and then it got woken up when data came back from disk. So that's kind of showing you all of these little pieces we've been talking about all put together kind of in a big image. Are there any questions on that? Hopefully this helps a little bit. So the steps in handling a page fault just to show you this in one last way is basically that our program's running, it causes a reference that looks up in the page table which traps with page fault. At that point, we figure out where the page is, it's on the backing store on the disk, we start the process of bringing it into a free frame. How did we get a free frame? Well, we replaced it or ultimately we're gonna be more sophisticated, we're gonna have a free list of free frames. But we bring the page in after it's brought in, we fix the page table entry and we restart, okay? Where's the TLB in all of this? Well, the interesting thing about it is the reason I haven't shown you the TLB in the last several slides is what? Can anybody give me a good reason why I haven't shown the TLB? Well, you could say this is all with physical addresses and that's true, but the reason I haven't shown you the TLB is the TLB is just a cache on the page table, it's making the page table faster. Just like I'm not showing you the cache in all of this either, the cache is just making the DRAM faster, all right? So you could think of the TLB is part of the page table that makes everything faster and when we talk about the hardware mechanisms, then we have to talk about the TLB. But here for this level of abstracting, I said we pull up a little bit, we don't even have to worry quite yet about the TLB. There are some cases that I mentioned in that slide where I showed you why this was like caching where we have to remember to invalidate the TLB to keep that caching illusion alive, okay? All right, so some questions we're gonna need to answer and we're gonna do that next time. During a page fault, where does the OS get a free frame? Well, it might keep a free list, there might be a reaper that's busy pushing pages out to disk and keeping that free list. Maybe as the last resort, we're gonna evict a dirty page first. There's gonna be some interesting policies there. How do we organize these mechanisms? We're gonna have to work on the replacement policy. That's gonna be our major topic next time. How many page frames per process? That's another question, sort of how much of that precious physical DRAMM do we give to each process? And that's gonna be a question, all right? All right, so to finish up, we talked about caching, we finished that up. The principle of locality, of temporal and spatial locality is really present in regular caches. It's also present in the TLB. How do we know when we have enough locality that the TLB works as a cache? We talked about three plus one major categories of cache misses, compulsory misses, conflict, capacity, coherence. We also talked about direct maps, set associative and fully associative organizations. Hopefully they reminded you basically from 61C. We talked about how to organize the translation, look aside buffer, we talked about organizations. We talked about why it's fully associative typically and what to do on a miss. And we are next time, we're gonna start talking about replacement policies and we're gonna start with some idealized ones like FIFO and Min and LRU it turns out is also an idealized one for reasons we're gonna talk about next time. All right, I've gone way over, so I'm gonna let you guys go. Hope you have a great evening and we'll see you on Wednesday.