 All right, welcome to another great day of operating systems. So, whoops, I have to meet myself. Thanks. All right. So, today we'll be talking about multi-level page tables. And sorry about yesterday, my laptop bricked itself when it decided to update to the latest Mac OS version. So, this is why operating systems are important because I got an error message that no one's ever seen before when I googled it. So, I had to reinstall my operating system. So, I missed Tuesday, and to make the lectures even, I was like, well, may as well cancel Wednesday too. So, it's about the midpoint of the course. So, since I am new and hopefully you can help shape the course and make it better, please provide some feedback. I put up a form so you can download the lecture notes and hit the link. And please fill that out because that would be very, very helpful. But today, what we left off on was talking about multi-level page tables to save space for sparse allocations. Because even if we have like a 39-bit virtual address with the normal page table entry size, our page table size for every process would have to be like one gigabyte, which doesn't really make sense. So, we need to do something better and multi-level page tables are way to do that. All right. So, the trick to multi-level page tables is that each smaller page table should occupy exactly one page. So, in this case, if our page table entry size is eight bytes and we have a four kilobyte page size, well then there would be two to the nine entries in our page table that fits on a single page. So, it would just be, which is two to the 12 divided by two to the three. So, that would get us to the nine. So, there would be 512 entries there. So, the idea here is that our page table entry for a higher level for like L1 points to a page table for the next one below it. So, it would just be, you can use the physical page number because you know everything's aligned to a page because each smaller page table exactly fits on a page. So, you don't have to worry about computing bytes or anything like that. I'm essentially going to use the offset to know where all my entries are located. So, you would follow these page tables all the way until you hit L1 or sorry L0 and then L0 would contain the physical page number and that would be just our same translation as with a single level page table. So, if we just consider one additional level, so we'll just have an L1 and an L2 page table. So, we have two page tables. We'll just consider a 30 bit virtual address instead of 39 just to make it that there's only two levels. So, if we write out this address, this is a virtual address. So, if we write out three F, F, F, F, F, 008, if we write that out in binary, it are those two entries there. One second. It's those two entries there. So, again, let's just consider a 30 bit virtual address. Page size is gonna be exactly the same. So, if we only had one single page table, it would have to be two megabytes because in this case there would be two to 30 bits and then 12 of them would be for the offset of a page. So, we would have two to the 18 virtual pages. So, to map them our single level page table would need to be to the 18 times two to the three which is two megabytes. So, the idea here is instead of having a big two megabyte page table, if we only need to map a single page or very few pages, we could instead have one page table for our L1. And again, remember they're fitting on a page so they would be four kilobytes. So, you would have one level one page table and then you would have one level zero page table. So, they're each four kilobytes. So, in total you would need eight kilobytes instead of two megabytes. Now, this scheme is better if there are few allocations but if you allocate all the memory and need to map everything, I would need to have a lot of the L zero page tables and in fact I would need to have two to the nine of them. So, I would need two megabytes of L zero page tables and then I'd still need my L one page table which would be four kilobytes. So, it would actually use four kilobytes more than a single level page table if we had to record entries for everything. So, this is how the translation would work. So, I can switch to my tablet quick. So, let's see how that translation would work. Oops, okay, key binds don't work anymore. All right, so let's go through this. So, again we're considering a 30 bit virtual address with a page size of four kilobytes, so 4,096 bytes. So, for a single level page table, again we would need two megabytes. So, it would be the number of virtual pages which is to the 18 times the page table entry size which would be eight bytes and they would be given to you so that's to the 20. So, any questions about that with the single level page table? Okay, well now if we have two levels of page tables so I wrote out that three F F F F F address and separated out the indices and the offsets. So, the L1 index would be 11111111 and the L0 index would be 11111111 and then there'd be the hex offset. I'll just leave as the 12 bits because we know we don't have to change it. So, to start off I would have to have a level one page table and it would have to point to someone. So, let's assume that our page table pointer is pointing at 7000 and again this is aligned at a page so we know everything starts at a page and all our page tables consume a page. So, if our L1 page table is at address 7000 then there would be a table here that would have 512 entries because remember that we're trying to fit everything in a page, this is our page size. So, it's two to the 12 divided by two to the three which is our page table entry which that means there are two to the nine. All right, so who wants to review last time? What's in the page table entry? What are some fun things in there? Okay, or what are our page tables even doing? Yeah, they're mapping virtual addresses to physical addresses. So, something that's in the page table entry well it's gonna have the physical page number. So, that's what the mapping is and then it's gonna have a bunch of other permission bits like one to see if it's valid, maybe one for a read, maybe one for a write, maybe a bit to say if it's executable or not, things like that. So, for the purposes of translating an address we can just assume that within that it just has the physical page number. So, our L1 page table which would be at address 7000 in this case, it would have 512 entries and then they would all, they would have an index which we wouldn't have to store because it's the index in here but it would have 012 dot dot dot and go all the way to 511. And then for translating, we only care about the physical page number so they would have some entries. So, there would be some entries for every single one. Now, instead before when the physical page number we just used it as the translation, now because we have two levels, this is an L1 page table. So, the physical page number is going to be the address of the L0 page table. So, in this case, let's assume that the physical page number here would be address eight. So, it'd be physical page eight which would mean somewhere else in memory there is now an L0 page table is at address 8000-000-0 because it's all a line on a page. That's our fun trick here. So, in there, again, would have 512 entries. Our index would be 0 dot dot dot dot dot and then if someone in there our physical page number is, say it's something like cafe, well then what would our final address be? Our final physical address be? Yeah, so someone answered in the live chat, oh yeah, yep. Yeah, so our final address, this is an L0 page table so it's exactly the same as our single level page table where we just take the physical page number and just plug it in and then keep the offset the same. So, this is our original address up here. This was our physical address, or sorry, our virtual address. So, our physical address is going to be zero cafe and then 008. All right, any questions about that? So, we're just following. So, now we just have, instead of one gigantic table, we just have, we do the old computer science trick and just solve things by adding another layer of indirection. So, we just have it point to a smaller page table, point to another page table and then we only allocate as many page tables as we need. So, no questions at all? Okay, so the processes, so that first, that highest level page table, that is going to be your kind of root page table. So, there's only going to be one, in that case if it's a 30 bit address, there's only going to be one L1 page table. In the case where if we go back to our 39 bit virtual address where we have three levels of page tables, we would only have one L2 page table. So, your processes are going to just use a register. So, on at least risk five, there's a register called SATP. That is basically the root page table. So, it would set that as your L2 page table and then you would go through and follow the page tables until you finally hit L0 and then it's the same translation as with a single level page table. So, you'd have one L2 page table here, which would fit exactly on a page. So, there'd be 512 entries, so zero to 511. And then each of the physical page numbers there are going to be able to point to another page table. So, you can have 512 of them. Then in the L1, you would have up to 512 of these. Each of them would have another 512 entries to point to different L0 page tables. And then once you hit L0, that's when you do your direct mapping right there. Any questions about that? Nice and fun, nice and efficient. Okay, so then the question might be, well, how do you actually allocate memory then? So, the kernel only cares about allocating memory in pages. It doesn't really care like malloc does about giving you a specified number of bytes. It just cares about pages and it does it just using a free list. And that's the easiest implementation you can have. So, given a set of physical pages, which would be your entire memory space, the offering system just maintains a free list that is just a simple link list. And it would initialize them initially as everything's free. And then all the unused pages themselves would just contain a pointer to the next free page. So, as part of the initialization process, your kernel would go through every page and just write pointers to the next free one because it's free, you can use it for whatever you want. And that would be initialized at boot. So, it would go through, write an address to every single page. And then once you have that free list of pointers set up, it's actually fairly easy to allocate and deallocate memory. So, to allocate a page, you just remove it from the free list and set up that mapping in your page table and just let it carry on. And if you deallocate a page, you just add it back to the free list. So, you just look at whatever's ahead of the list, you make this page the head of the list and just add the new page to that. And that way you can also reuse memory. So, our link lists learning has not been in vain because that's the simplest way to allocate memory. You just do it in pages, if something wants a page, you map it, pop it off the list to deallocate it, you just add it back onto the list. So, you might think this is painfully slow. So, if you have to use page tables for every single memory access, that seems really, really, really slow. So, you have to follow pointers across multiple levels of page tables. So, for instance, if you're accessing virtual address, say you're accessing virtual address like FFFF, whatever, and there were three levels of page tables, well, instead of just reading that memory and being able to de-reference that address, well, to actually figure out where that address is in memory, you have to first access the L2 page table, that's a memory access, figure out where the L1 is, and at L1 you have to access L0, and at L0 you have to actually, then you finally get the actual physical page and can do the memory access. So, you're essentially turning one memory access that should be nice and simple into four because we're using a page tables. So, this seems really, really bad and you're likely going to access the same page multiple times, so if you're using the same variable or the same programs executing, likely something else will be very close to it and also in the same page. So, if I already do the translation, I don't need to redo the translation once you're accessing the same variable again or another closely related variable. And your processes too might only need a few virtual page number to physical page number mappings at a single time. So, this is another classic solution where we would just, instead of doing the translation every single time, we would just cache it and that's our classical solution here. So, a TLB or a translation look-aside buffer is exactly that, it's just a cache for virtual addresses. So, your CPU is going to have the virtual address, sometimes it's called a logical address and then have a mapping here and store essentially a cache. So, this at the top here is going to be your cache of virtual addresses and if there is a miss, so it is not in the cache, it's going to take the path at the bottom and then access the page table, go do all the translations and figure out where that actual physical address is and then it would put it into the cache. And if it's already in the cache, if that page number is already in the cache, it would just reuse it directly and then you'd get your physical address and you're done at that point. So, you would just look it up in the cache and if it's there, it's there, everything's nice and fast. Otherwise, you have to go through the page table and figure out where that address is. So, this is where we get into one of our goals is to make this as fast as possible. So, if we just have page tables and we do the translation every single time, it's going to be very, very slow, but if we have a cache and we know the ratio of hits in our cache and how fast our cache is, we can actually argue about our effective access time or the average access time of a memory address. So, this is best case scenario, so this is assuming only a single page table where there's only one additional memory access to look up where the entry is in the page table. So, in this case, if we have a hit in our cache, well, it's going to be the time it takes to look it up in the cache plus the time of the initial memory access, like the actual physical memory access is going to be the total time it takes if it is in our cache. And then, if it's not in our cache, well, we're going to waste the number of time or the amount of time we're actually searching for it, plus we have to have two memory accesses. So, we have to look it up in the page table and then we actually have to use that memory directly. So, anyone tell me what this would be if I have a three level page table, what would my miss time be? Six, six, three, six, why six? Yeah, there's three tables. If there's three levels of page tables, then we have our original memory access, which is one, and then you have one memory access for each level of page table. So, there'd be three, and then plus the original one. So, it'd be four. So, this is the case where there's only a single page table. So, it would be the original memory access plus a page table access. Well, if we had three levels of page tables, then it would be the search time plus we'd be three memory accesses to go over the page table and then the original memory access. So, you can see how this gets even worse the more levels we have. So, if we had four levels and we had a massive table, then this would be five times memory. So, it would get even slower. So, each miss would be more costly. So, the effective access time would just be the hit ratio times how long it takes to resolve a hit plus the miss ratio plus how long it takes to resolve a miss. So, some typical numbers. You'd have like an 80% hit rate and maybe you take 10 nanoseconds to actually search your cache and then by default memory accesses take 100 nanoseconds. Then if you calculate the effective access time, well, 80% of the time it would take 110 nanoseconds. So, search time plus the original memory access and then for the miss, so 20% of the time it would take that 10 nanoseconds plus two times the 100 nanoseconds for each memory access. So, our effective access time would be 130 nanoseconds compared to if we were just accessing physical memory by itself, then it would take just 100 nanoseconds. So, we essentially have about a 30% overhead for using virtual addresses in this case, which isn't too bad, but it is not free. But it's much better than the alternative as if we didn't have caches at all. So, if we had like a three level page table where every memory access has to resolve through the page table, everything would be four times as slow by default and that would be just a gigantic mess. So, then you might ask yourself, well, if I have this big cache, how does that work with context switching? So, if your context switching and each process has its own address space, which means it has its own page table, it means that it's mapping of virtual addresses to physical addresses are unique. So, you can't reuse the same TLB between processes when you context switch because it would have cache entries for the process you just switched to. And if you didn't do something about that, that would be really, really bad. So, any memory accesses that have happened to use the same virtual address would just hit the cache and go to the process before it, which would essentially leak that memory and it would be able to change it and all sorts of bad things can happen. So, some things you can do about it is you could just flush the cache, so you just invalidate it whenever you switch processes, which is going to be a bit slow since the new process is not going to have anything valid in the cache at all. Or what you could do because the mappings are actually consistent for a single process is in the TLB, if hardware designers are nice enough, they could also put in essentially a column for the process ID so that you know that virtual address is valid for that process and then you don't have to flush the cache. So, if another process tries to access a virtual memory, it would use its process ID and then it wouldn't line up and then it would go through the normal process. But most implementations, the easiest thing to do and safest thing to do is just to flush the cache. So, on risk five, there's an S-Fence instruction that explicitly flushes the TLB. On x86, if you essentially change whatever page table you're going to use, which is a register, by default on x86, if you change that register, it's just going to flush the page table for you and you're not going to have any say about the matter. So, it's just automatically be switched for you. So, another fun question is, how many levels of page tables do I need? So, if we go through this question, the question will always say like what size virtual address and what the page table size is. So, in this case, it's a little bit different than what we saw before. So, we would have a 32-bit virtual address with a page size of 4,096. So, this is like back old systems. So, this is like pre-2000 systems where things were actually 32-bit. And then you'd have, in this case, instead of the page table entry size being four bytes because it's only a 32-bit machine, it's likely going to only have a page table size or page table entry size of four bytes. So, for all the multi-level page table tricks or questions, the trick is you want each page table to fit exactly on a page. So, first we want to find the number of page table entries we could have in a page. In this case, it's two to the 10 because our page size is two to the 12 and now our page table entry size is only two to the two and not two to the three anymore. So, we can have two to the 10 entries in a single page. So, if we take log two of the number of page table entries per page, so that two to the 10, which is basically just the number of bits you have, that is the number of bits you need to have to index into a page. So, to calculate the number of levels you need, it is the number of virtual bits you have in your address minus the bits used for the offset, which is exactly your page size. And then you just divide by the total number of index bits and then you essentially round up. That's what that ceiling operator is. So, in this case for this question, we would have 32 minus 12. So, that is 20 divided by 10. In this case, luckily, it is nicely divisible and that is equal to two. So, on these old systems, you could get away with having only two levels of page tables because the page table entry size was lower and we had fewer bits. But now, essentially, we can only fit, we only have nine index bits instead of 10 and we have more levels. But in this case, if we had also two because this is a round up, if for some reason, the question said we have 33 bit virtual address, then this would be 11 divided by 10, which would be 1.1. So, we'd have to round up, or sorry, not 1.1, sorry. It would be 21, which is 2.1. So, we'd have to round up and in that case, we'd need three levels of page tables. All right, any questions about that? All right. Well, then let's go to some example to actually test out the TLB to see if it's actually working. So, in the lecture 12, there is this test TLB repository and it's a sub module, so if you want to get it yourself, you need to run that command. And this is actually written by Linus Torvalds, so this is the guy that wrote the kernel. So, he wrote this little test so you can be sure that it works and it is same. So, that is kind of cool. So, let's go look at that. So, basically it just has a few arguments. So, the first argument is the number of bytes to allocate, and then the second argument is the stride or how many bytes I should skip whenever I do a memory access. So, in this case, this would allocate a single page and then access every fourth bit, or sorry, every fourth byte. So, it access byte zero, byte four, byte eight, byte 16, go on and on until it accesses every byte on the, until it runs out of room on the page. So, what it does is it will record how long it takes for all the accesses and tell you approximately how many cycles it takes to do a memory access. So, in this case, because it's essentially a buffer, or sorry, because the TLB is essentially cache, I'm allocating a page and the first time it needs to translate that virtual address to a physical address, it won't be in the cache. So, it needs to go in the page table and actually resolve it and actually look it up. And after that, it would be in the cache, so it's gonna be a cache hit every single time after the first one. So, anyone want to tell me what would happen if I do something like this? So, in this example, if I run this, it would allocate essentially 512 megabytes and then access every 4,096 bytes apart. So, the first access would be byte zero, then byte 4,096, then byte 8,192, so on and so forth. So, anyone tell me what I might expect if I run this, yeah, how much longer? More than 10 seconds, so why would it take a lot longer? Yeah, so in terms of my TLB, how many times am I gonna hit the cache in this case? So, the cache is just keeping track of like virtual page number to physical page number and how big's the page? Yeah, so if I'm accessing 4,096 bytes apart every access, are things going to be cached? No, I'm essentially accessing a new page every single time, so it would need to go through the page table, do the translation every single time because no access in this case is gonna be cached. So, if I run this, it takes a lot longer. So, instead of, you know, under two nanoseconds, it takes like 40, so this is significantly slower. So, the caches actually make a large difference to your program, cause that's like almost 40 times slower. So, this is why you want to, if you care about performance, why you should try and fit as most information as close together as possible to take advantages of caching and being able to access things more or less sequentially. So, if I do something like this, how fast should this be? So, I'm going to allocate, was it's like 16 megabytes? Yeah, I think 16 megabytes, and I'm gonna access every 128 byte quicker than the last one but not as fast as the first one. Yeah, so, in this case, it's gonna be like somewhere in between cause I'm going to get, you know, I'm going to essentially get a bunch of cache hits but I'll have to resolve them every so often and I will get something probably in between. So, in this case, it would be like 11 nanoseconds. Yeah, yeah, so, because everything's on a page, so you would have 32 accesses on the same page. So, the first one is gonna be a miss and then you're gonna have essentially one miss and then 31 hits in a row and then you'll have a miss, then 31 hits, miss, 31 hits, miss, 31 hits. And in the first case, I had one miss and then I had 1,023 hits and then in the last case, I had every single miss. Yeah, the first access is always gonna miss because your caches are cold. So, yeah, there's nothing in your cache initially. So, on the first access, it's always gonna be a miss. So, the first one is gonna be one miss followed by 1,023 hits and for this case, there's going to be one hit and then 31 misses and then in the last case, that was really, really, really, really slow. They were all misses and that's because it has to go through all the page tables every single time and it's just really slow. But, okay, any questions on that? It's kind of cool. Okay, so, for user space allocations, we saw this S-Berk instruction and that's what the kernel cares about for allocation. So, this call, which is a system call, just grows or shrinks your heap as you want and your stack, of course, just has a set limit. It's like eight megabytes. So, whenever it needs to grow, all it does is grab pages from the free list to fulfill the request. So, if you need 10 more pages or something like that, it would just grab 10 more pages and then it would map them in the page table and then essentially set the valid bit as being true and whatever permissions it needs. So, it would probably sit valid is true, read and write. So, it kind of kicks the problem of having smaller allocations to the curb and kicks it off to user space to deal with it. So, what we'll talk about later is actually how to do memory allocation. And if you are a memory allocator, this is what you have to deal with. You can only grow and shrink the heap. So, you can imagine if you're a memory allocator and you actually want to free memory again, it's actually going to be really difficult to use because to shrink the heap, basically none of those addresses, all the addresses need to be free and not accessible for you to move it all the way up wholesale. So, it's really, really bad. It's doing just like mostly stay claimed by the process and the kernel can't free any pages in that case. But one thing you can do in your programs and what some memory allocators do is there is this Mmap system call that we might see later that allows you to bring in large blocks of virtual memory and it will just map that and then it allows you to unmap it so you're not bound by actually just one heap that is one big giant chunk of memory. So, we might see that one fun lab some places do is like a memory allocation lab where you actually write your own malloc implementation. That could be fun, but I don't think we have time for it, unfortunately. So, this is what the processes address space looks like again at the bottom. It's going to have some text data, maybe a guard page so that if your stack overflows then you actually get some indication that overflows instead of it just randomly writing memory. Then your stack would look something like this and it's set up by the kernel. So, in the case of C programs, whenever it starts or whenever it starts any executable it will initialize the stack with whatever arguments created it which would come from an exec VE call and it would allocate all of them and then a return address and then it would have some empty address and then C would take over and do whatever and then like we saw before there'd be heap on that and then at least for some OSs you might have a trap frame and some trampoline addresses that we won't really get into. But the trampoline is just a fixed address set by the kernel. So, for some things you can actually access kernel data without actually making a system call and you have a guard page to actually, so you can actually detect whenever there's a stack overflow. So, the kernel could just put a guard page in make it so that you don't have any permissions to access it so that instead of overwriting random memory whenever you overflow your stack it would generate a fault and if you know where that guard page is and you are the kernel and you have to handle that you know it's a stack overflow. Again, a trap is just anytime some special handler code runs like system calls, exceptions or some interrupt. So, page faults are what allow the operating system to handle virtual memory they're a type of exception for virtual memory accesses and they're gonna be generated if you can't find a translation or if a permission check fails and it generates a page fault or what we actually see in user space it is called a segmentation fault even though we know segmentations aren't really used anymore. So, because of this the operating system can handle it because a page fault gets generated so some options it has is it could lazily allocate pages so it could actually not put entries in to the page table and actually have it backed by a real page it could just do it on demand so whenever there's a page fault then it can get a page so that's why some of your programs can use a lot more memory than it looks like you have on your machine like for one of those examples we had I allocated more memory than I had on my virtual machine and malloc return fine gave me an address and it said, hey everything was fine but when I actually tried to access those addresses I actually ran out of physical memory and it just killed my process and didn't give me any sort of error message so that's one thing I can do it might generate weird errors or it might actually make your life easier for the most part it generally makes your life harder you can also implement copy on write which we talked about before or even swap memory to disk to give the illusion that you have way more memory than you actually have and we'll see strategies for doing that in the next lectures but at the core of it all page tables do no matter if you have a single level or multiple levels is translate virtual addresses to physical addresses so the MMU is the hardware that uses page tables and the MMU depending on what it uses may just use a single large page table which is going to be fairly wasteful even for 32 bit machines the megabytes can add up very quickly but especially for a 64 bit machine where they can be gigabytes if you use them they're kind of a non-starter so we use multi-level page tables to save spaces for sparse allocations and the kernel can just allocate pages from a free list it's allocation strategy and deallocation strategy it's quite simple it's just adding and removing from a linked list and then finally we saw that we use the TLB to speed up our memory accesses which essentially just acts as a cache so any questions before we wrap up? Slightly early, nice. Yeah. Yeah. Oh yeah sorry so the question is you know the alpha factor so here? Yeah sorry I thought yeah so alpha is essentially just the hit ratio so it's yeah so it's just the proportion of cache hits so in this case 80% so alpha minus one is 20% all right please fill out that survey and feedback and hopefully everything's going okay yeah I guess there's hopefully this is like the low point in the class so you are probably you know you have to deal with all the labs and all the quizzes so if there's any point in the class you're like the most peeved and most willing to give feedback it's probably now so there should be hopefully the worst time for feedback in a way but hopefully it's good so just remember I'm pulling for you