 All right, good afternoon. So, today we're talking about multi-level page tables. They're like single-level page tables, except more fun. So first off, it's like, I guess, after the midterm season, so hopefully, since I'm new here, people can give me some feedback since there's probably lots of stuff I can prove yet. A what? It's clickable. It's download the slides. Yeah, so download the slides, click the link. That apparently was the shortest link Google would give me, but I could have done another one. It's posted on Discord too. Yeah, it's posted on Discord. It's clickable if you download the slides. And yeah, that would be much appreciated since I am brand new. Okay, so today also, and it's after the quizzes and the labs, so there's separate sections to give feedback on the labs because I know they're a bit not terribly great, so there's like a lecture half and then a lab half and then like a free form. So, today though, we're talking about page tables and multi-level page tables and how they can save some space for sparse allocations because as we were talking about, even with a 39-bit virtual address, if we have a single-level page table, each process could have a gigabyte page table, which is not great, not gonna work. Yeah, so if we have a 39-bit virtual address and our page size is 4,000, whatever, that means that there have to be, was that 39 minus 12, 21 entries? Or no, more than that. All right, let's go to, so, okay, that doesn't work. Great, thank you. So, if we have a 39-bit virtual address, well then, we'll have 12 bits for the offset and then we would have 27 bits for the remaining, right? So, and then, so there'd be two to the 27 entries that we'd have to compute and if each entry, so you'd be given the size of an entry, each entry would be eight bytes, so it'd be the same as two to the power of three, so that is equal to two to the power of 30, which is a gigabyte, so it would take a gigabyte if we have a big page table. Okay, I really wish this worked. Okay, so this is what we're gonna do to save space. The trick here is to make each smaller page table fit exactly on a page and then they just, and then an L1, or sorry, an L2 page table would just point to an L1 page table, which would then point to an L0 page table and then we'd translate the address just straight up from there just replacing the virtual page number with the physical page number from the L0 page table to give us a physical address from that virtual address. So, we'll walk through it, yep, sorry? Yeah, EXT. Oh, EXT, it's just extension, so some bits we don't use because this is used space for a 64-bit address and this is only using 39. So, it's just extended use that you can use later. So if you want, you can have more levels of page tables and have like a fourth level if you want and support more virtual addresses. Yep, so sparse allocation just means so with a normal single level page table, you have room to allocate everything, so you could have an entry for everything. For sparse allocations, it means like you can only use like one or two pages at a time, so if you only use one or two pages at a time, you don't need a massive table that's mostly empty. So sparse just means there's not that many allocations. So, for RISC-5, you want each level to occupy one page. So, if we know the size of a page and the size of a page table entry, you can figure out how many entries can fit on a single page. So, in this case, the page size would be 4096, so two to the 12 and each of the entries are eight bytes, so two to the three, so we can have two to the nine entries on each page. Make sense for everyone? So, then the page table entry, if you're in LN, it just points to the next lower level page table, LN minus one, until you eventually get to L0 and then that's your translation and we'll see an example of that. So, you just keep on following them down the levels until L0 and then that's your translation there. Yep, that will just be told to you, so if it's a 64-bit machine, it'll be eight bytes. So, we'll see an example where we do a 32-bit machine so they're gonna be four bytes. So, you can fit more entries. Okay, so this is how you do a translation with just one additional level. So, we'll just consider one virtual address which is 3FFFF008, which if you write it out in binary, it is this or if you separate it out for our L1 index, it would be the nine top bits, so it'd be 111111111 and then the next nine bits are 111111111 and then the last 12 bits which correspond to the offset is 0, 0, 0 and then 1, 0, 0, 0. So, this is a 30-bit virtual address so we're just using two levels of page table. So, if this was only a single level page table and we were supporting a 30-bit virtual address, then our single page table would be two megabytes so it would be 30 minus 12 so we'd have two of the 18 entries there and each entry would be two to the three so that means if we had a single page table for a 30-bit address translation then it would be two megabytes. Everyone good with that? So, back to lecture, the previous lecture but if we only had 30 bits. So, if we had a single level page table it would be two megabytes but to just map a single page because you'd need a whole page table but in this case, if we have multiple levels of page tables then you only need one page for the L1 table and then one page for the L0 table so in total you only need eight kilobytes for the page tables instead of a whole two megabytes and in the worst case, this scheme would actually consume more memory than if you had a single page table if you had to record every single entry because you'd have an L1 page table with 512 L0 page tables which would waste up two megabytes plus the L1 page table. So, let's see that quick. So again, here's what we're considering a 30-bit virtual address and we'd be given the page size four kilobytes. So, for a single page table we'd need two megabytes so it would be the number of virtual pages which in that case, our page size is two to the 12, oops, page size, and then we'd have up to two to the 30 addresses in this so that would be two to the 18 entries we would need. So, we have two to the 18 entries, it's the number of virtual pages we need to keep track of and our page table entry size would be eight bytes so that's where we get our two megabytes so this is two megabytes and I write MIB because it's Megley bytes there's some weird way to say that it basically just means it's always power of two if it's megabytes, whether it's power of two or a thousand it is up to whoever is marketing the thing so, but if you read MIB it's like power of two and you know what it's saying. So, we have that big address which was three FFFFFFF and then 008 so I'll just write the hex offset here as hex so those are the 12 lower bits so those are the offset that we don't have to translate so then the rest of them are index bits for nine so your first L1 index here would be 11111 which is the same as 511 so remember when you're translating these addresses you start off with a root page table which would be the highest level so your program would initially point to a L1 page table so in order for this to actually translate well, well let's assume that the root page table is at address seven thousand so it has to be aligned to a page so it would be at 11,000 and then the page table itself would have five or sorry, two to the nine entries which would be zero all the way to 511 and then since this is a page table one of the things it would have is the physical page number that corresponds to that entry so for this, the entry for 511 could be something like just eight so if it's eight and then what else would be in the page table entry? Yeah, there's some flags if it is one of the things that the hardware would need to actually follow the page tables as a valid bit so there'd probably be a valid bit here to say that hey, this entry is valid so since everything's aligned to a page in this one, the L1 index was 511 so it would go look up that in the L1 page table in this case it would show that hey, this is actually pointing to physical page number eight so we would look up another page table and the address of this page table would start at 8,000 because it's aligned at a page so now this would be our L0 page table so in the L0 page table there again would be the same number of entries so in this case it could have a physical page number of something like CAFE and be valid in that case this is an L0 if we look up our index from the address it was 511 as well so since this is an L0 table we can just directly translate it so what would our virtual address be? Or sorry, yeah, what would our physical address be that we're translating from this virtual address? Yeah, so the offset would stay the same because it's within a page so our physical address we followed it all the way to L0 so we just plop that in for the physical page number CAFE and then the offset was 008, yep yeah, so the question is how does that save space? So each of these smaller page tables they fit on a page so this takes up two pages and if we had a single big page table it would be two megabytes yeah, so this is two levels so the nine bits are here and then nine bits are here so if we just had one big table our index would be 18 bits, right? So if we have 18 bits that means we would have two of the 18 entries times two to the three which is the page table entry size, sorry yeah, so page table entry size is gonna have inside the page table entry is gonna be the ppn and flags yeah, so this would be two megabytes, yep so this, since it's a physical page number you assume that everything's page aligned so those like three hex characters would be 000 so that's why if it's eight here that's physical page number eight because whenever we translate addresses we never touch the offset as many as you need yeah, yep, yeah so this is why it saves because you just have as many L0 tables as you need for this so this corresponds to the next level of page table and you don't know what index you're even gonna be accessing so you assume it's aligned to a page and then you use the index from the address here which could be anywhere on that page it wouldn't correspond to 008 the index so because it's aligned on a page fits on a page you know where to access everything so if it's index zero it's the first one if it's index one it's byte four and so on and so forth yeah, yeah so L0 is gonna have to be allocated somewhere yeah, you're the operating system you can do what you want yeah, physical page number so which just says in the case where it's pointing to the lower level it's telling you the physical page number that lower level page table is on so 511 is the index from the virtual address so when we had, where are we? so when we had here our virtual address was 3FFFF008 so if we divide that and go back one so here it only goes up to L1 so this is the number of index bits for L1 and then the number of index bits for L0 so that's where they came from yeah, so you would have to the hardware would have to know how many levels are set up so these are different modes so in this case it would be like the 39 bit mode which knows it needs three levels if you wanted to have a 30 bit one which it doesn't support it would know that follow only two levels of page tables and you would have to set it up yeah, yes yeah, 08 is the physical address where that L0 page table is yeah, yeah, the page tables will only have physical addresses in them yeah, so all the, like in this case all the L1 physical page numbers they're going to correspond to still physical memory but it's going to be the physical memory where another page table is and it's all going to be in physical memory because the kernel is managing that so the kernel is managing all the physical memory yeah, all of them will tell you exactly where the physical address is well, it's 512 that point to another page table that you can have 512 of that point to another 512 so it's like exponential, right? It's like a big tree so you can support the same number of addresses yeah, yeah, in the worst case you're actually using more memory so in the worst case where you have to have an entry for every single thing all of your L0 page tables would be full so you'd have 512 of them which is 8 megabytes by itself and then you'd have one L1 page table to point to all of those so you'd waste an additional page if everything's full which would get even worse if you had three levels so you'd have, you know, 512 times 512 L1s or L0s, yeah, yeah unless you're like, unless you're visual studio code you don't use like gigabytes of memory yeah, yeah, the MMU would know what, how to, what mode it's in and you as a kernel have to make sure that your page table points to a lower level one and a lower level one and it knows that it only goes three deep in that case yeah, the offset at the end for which one, sorry? yeah, yeah so this 0, 0, 0 here? so it comes from the fact that all the page tables are going to be aligned to a page so if there are a lot, yeah, 12 bits which is 3 zeros in hex so they're all aligned to a page otherwise if they're not aligned to a page then it's like, oh, well where do I start? how do I access whatever index it's at? it just becomes a giant mess so if everything's fits on a page and is aligned to a page if you're the kernel you only have to deal with pages and everything past that is fairly easy and if you, and essentially if I do this translation and have this all set up well, if I want to translate the address so say I'm accessing like the next instruction that has an offset of 0, 0, 9 I mean it's going to do the exact same translation go through all of it and the physical address at the end of it is going to be C-A-F-E-0-0-9 and I'm not going to have to do new page table entries or anything like that so they're all going to be the same entry yeah, yeah, so well the kernel is going to try and minimize the number of tables there are so it will give you, that's why it gives you virtual addresses sequentially like your heap starts from the top and goes down so they're all going to be beside each other in these page tables so it's going to give you sequential virtual addresses to make this as small as possible yeah, yeah and there's only one root so there's two levels of page table you'll have an L1 and an L0 so you'll only have one L0 so that's the pointer that changes when it changes processes but if you have three levels where you have an L2 an L1 and an L0 you'll have one L2 and that's the pointer that changes no, it changes so that's your root page table oh, I got it yeah SATP is the risk five name for the root page table it stands for supervisor address translation and protection it's a very good name okay, so we did that so that was our example where we had one L1 page table with the entry index 511 with the physical page number 888 so that means since all the page tables are aligned at a page that there's an L1 page table located at 8000 and then it would have some entries in it hopefully it would have the entry for 511 with CAFE and it would be valid in that case the physical address would be CAFE008 mm-hmm yeah, if it's not valid it's gonna be a page fault and in that case in your user process you'll get a seg fault but we all know that seg fault names just silly because segmentations don't exist anymore yeah in this case the physical page number would be as much as there are in that entry so there's just a lot of leading zeros in this case if it is a 56 bit physical address there'd be 44 bits in there but all the leading ones are zero so if you're the kernel and you don't wanna make sure that you give the same virtual address to two or more processes no, like the same physical address they're the same physical address yeah, in general but so you wanna make sure that you don't map the same page to two different processes so you have to keep track of well we'll go into that if you're the kernel and you're allocating memory you allocate it in terms of pages so you give it, you map it and then you say it's used so you don't give it to anything else we'll talk, yeah, well that's next I think, yeah so processes use a register like SATP which was that stupid acronym to set the root page table for your process and then after that the MMU is going to use that page table to go and do the virtual address the physical address translation and there's a good comment too of like isn't this slow where I have to access memory to access memory again to maybe access memory again and then I get my physical memory where my information is actually and that's a good comment it is slow but first talk about the page allocation so it can just if the kernel only allocates in things of a page so it's actually really easy you just have a free list of pages so if you have when the kernel boots up it knows how much physical memory there are it would just divide the physical memory into pages and maintain a free list so on every single page it would just write out a pointer to the next one as part of the initialization process so it only needs to write a pointer every like four kilobytes so it could do that fairly fast and then you have a link list of your entire memory and then whenever yeah so then these are all unused so it'll just have some pointer to the next free one and that happens at boot and then whenever you need to grab a page you just look at the first page in your free list so you just look at the first page in your free list you map it and then it's not free anymore yep yep yeah kernels so memory allocation like your malloc and stuff like that that actually cares about different sizes that the kernel doesn't care so yeah malloc deals with it so malloc will only allocate heap so it will just make the heap bigger and the kernel does that in terms of just backing it with pages and then malloc figures out how to give you your eight bytes or whatever you need so a fun lab is to so malloc in that that's all user space fun lab you could do that I guess is in this course is you could write your own malloc because the kernel doesn't really care it just gives you heap and then says you manage the heap so yep there's a system call to increase the size of your heap but that's it and only does it in terms of pages sorry? you can decrease but it's gonna be hard because you have to have the bottom page filled to be able to decrease it or sorry you have to have the bottom page free to be able to decrease it so if there's still an allocation alive that's like in the middle or somewhere you can't free pages yep yeah for what why would you have to follow the chain yeah when the curl is going to put it into the page table the free list has a pointer to the next free one so it just picks that one that's the next page so that's it would just grab that page and then it would be an entry in the table yeah it just picks off the head of the list and says hey it's not there anymore and fills in that entry so if you need to back some memory you'd eventually get to L0 and you need to grab a page to correspond to that if there's not you'd grab it from the free list that's really quick you'd grab it from the top of the head you would populate that entry in your L0 page table and then you're done then if you need to allocate it it's also really easy if you're the kernel because you know what it maps to so you just say hey you're not valid anymore so then it wouldn't translate it anymore and then you can add it back to the free list so you just get your page back so your and allocating and deallocating pages is really really easy so the kernel can do that pretty easy um and yeah user space allocation is a lot harder and we'll talk about user space allocation later but the kernel gets it easy it just cares about pages so here's where yeah using page tables for every memory access is gonna be really slow because we have to follow pointers across multiple levels of page tables and likely you also use the same page multiple times so if you think of your C program you know your stack would be on a page so if you use one element on your stack you're probably gonna use the next element on the stack and then the one after that so why would you bother translating the same thing that translates to the same physical page number a thousand times when you know you're gonna use it over and over again and also some processes might only need a few virtual page number to physical page number mappings so we just pull out the classic solution and throw a cache on it so the cache for your MMU is called a TLB or translation look-aside buffer the image kind of sucks but basically all it is is there is a logical address up here which is the same as virtual and then there is a cache so it will check the check the virtual page number see if it's in the cache which is that ugly thing in the middle and then it would have the page number to frame number which is just the physical page number again this is where the terminology gets weird between people and if it's in the cache it just gives you that address back and then you have your physical address so you can access the memory if it's a miss then you have to go through that page table thing of going through if there's three levels of page tables you have to go through L2, L1, L0 and then the memory access so remember one of our goals was to make things really really quick and make it appear that using virtual memory is just as fast as physical memory so since there's a cache we can calculate whatever our effective access time is so in this case we could assume a single page table so there's only one additional memory access if we miss so the TLB hit time is if it's in our cache so it will take a small amount of time to check if it is actually in our cache so the time for our memory access if we have an entry in the cache is the amount of time it takes to search the cache and get it from it plus the time of the initial memory access itself and then if there's a miss well we still had to search the cache and then in this case we had to do two memory accesses so if there's only one level of page table we had to look up the page table find wherever that physical memory was and then go ahead and do the original access as well so if I had three levels of page table what would that be? Yeah Y3 Yeah N4 Yeah so you need if there were three levels of page tables there would be four memory accesses so one for each level of page table and your initial access so it gets really bad the more levels you have which is why on a 64-bit machine we only have a 39-bit virtual address because it's good enough and if we added another level it would be another level and it would be slower Yeah currently yeah but you can switch the mode on it so that's like the default mode of 39-bit on RIS 5 processors you can do 48 or 47-bit is supported and you're not going to get more than 47-bits of memory right now Okay and then to calculate your effective access time or your EAT it is alpha and alpha is just the proportion of cache hits so it's the proportion of hits times the time for a hit plus one minus alpha which is your miss ratio times the amount of time it takes for a miss so if you assume alpha is like 80% hits and your search time is 10 nanoseconds and your memory access is 100 nanoseconds well then our effective access time would be 0.8 times 110 for our search and our memory access plus 20% of the time we'd have to take two memory accesses and the search so our effective access time in this case would be 130 nanoseconds or like 30% slower than it would be if we just use physical memory which isn't great but it is way better than just having virtual memory without this cache because all of our accesses would be 200 nanoseconds in that case and realistically if we had the three level page table they would all be 400 nanoseconds which would be four times as slow so instead of being like four times as slow we're like 30% as slow so now too there is a problem with this cache whenever we're switching processes so if we switch processes and we don't do anything with the cache well if the new process I got context switched in happens to use the same virtual address and we don't do anything with the cache it would be a cache hit and then it would map to the old processes memory and then it could monkey around with that and bad things could happen so the simplest thing you can do is just invalidate the cache whenever you context switch so that it has to recalculate the virtual address for the new process which would work and it's a bit slow some implementations if hardware developers are nice to you in your TLB you can tag entries with a process ID to make them unique so you would tag a process ID with the virtual address so that when you context switch in and your new process happens to use the same virtual address it would have a new process ID so it wouldn't actually get the result of the last process so that's in some hardware but not really for risk five you have to explicitly flush the cache with that S-Fence instruction on x86 if you change the root page table pointer it automatically flushes the cache for you so you don't have to worry about it so another question too is how many levels of page tables do I need so it depends on the size of your virtual address so here's an example for like pre-2000 error when we had 32-bit processors so if we have a 32-bit virtual address and our page size of 4096 well since we're on a 32-bit machine our page table entry is only four bytes now because we have less memory it's only a 32-byte machine so we actually have more room for entries on a single page so again we want all of our lower level or all of our page tables to fit exactly on a page so in this case we can fit two to the ten page table entries on a page so each page is two to the twelve and now the page table entry is two to the two instead of two to the three so we can fit two to the ten entries now so the number of bits you need to index into the page is just like the log two of that of the number of page table entries per page which would be ten so in this case the number of levels we need is there's a round up of ceiling so the number of virtual bits minus offset bits so in this case it would be 32 minus 12 because that's our page size so which would be 20 divided by the number of index bits we need for each page table so in this case it would be ten so 20 divided by ten is two nice even number we don't have to round up so we need two levels but in this case if it was a 33 bit virtual address then we'd have 21 divided by ten which would be 2.1 so we'd have to round up and we'd need a third level of page tables which of course would make everything slower okay so to see that the TLB is actually a thing there is this fun little tool written by Linus Torvalds himself so he wrote this so it's definitely correct it gives you two arguments it will do an allocation of size size and then access that memory every stride bytes so if it stride is four it's gonna access every four bytes so byte zero four eight twelve sixteen so on so let's go ahead and see that so knowing what we know about the TLB why doesn't this work anymore so knowing what we know about the TLB whoops if we do this well we would allocate ourselves a page and then access every four bytes so if we do this this should be relatively fast why would this be really fast or relatively fast yeah yeah so it's all in the same page and that page would get cached so we allocated four thousand ninety six bytes and we accessed every four so if we do that we accessed memory one thousand twenty four times so since they're on the same page the first time we tried to access memory it wouldn't be in the cache so we'd have to follow the page tables and then do that but then after that since everything's on the same page there'd all be cache hits so it'd be fast so on the flip side if I allocate five hundred and twelve megabytes and then access memory every page this should be really really slow yeah it depends on the malloc implementation for that bank example I had I allocated like this much memory and it was like yeah no problem and then yeah when I tried to access it it crashed uh no so I actually have enough RAM to do this the other one it just kind of lied but if I increase this a lot more it would crash the same way so if I do this this is essentially no cache hits so I'm accessing every page so every single time I go through it's going to have to do go through all the page tables look it up and it is dog slow so it is 40 nanoseconds instead of like under two so this does have a big effect and this is also why you want to keep things as close as possible so if we do this one where we allocate 16 megabytes and access every 128 well it should be somewhere in between so if we run that get 11 nanoseconds which isn't great and that's because well within the same page we're doing 32 accesses so what the first access is a miss so we'll miss once and then we'll have 31 hits in a row and then miss 31 hits miss 31 hits so it depends on the size for the number of total accesses 1024 so the number of accesses is just you know oh i can't copy and paste the number of accesses is just side divided by stride so i don't i don't think they're all the same it's just how many measurements you take so it just takes the average of them so it's yeah so more measurements the more consistent your number would be but it'd be about the same yeah it's taking the average no this is the average time for memory accesses because 40 nanoseconds is really really fast okay so cycles so the smallest one was 6.6 cycles and this was on my machine so i ran again and it was a bit different so you can see it makes a vast difference so this is where you know the kernel only cares about so that's that s-brake instruction that just makes the heap bigger so it would grow or shrink the heap and that's it your stack has a set limit it would be like eight megabytes so for growing the kernel just grabs pages for you from a free list to fulfill it the kernel would set you know the page table entry valid do the mapping for you and then in memory allocators this interface sucks because you generally can never free memory again because you need to free the like last allocated page and you'd have to make sure that that's not being used anymore and in general that's not going to happen it'll just stay claimed by the process and waste memory but if you're a memory allocator a fun other system call you can make it's called mmap so that can bring in large blocks of memory at a time and then you can go ahead and free them because it'll be backed by pages so if you use a page and you know you're not using that page anymore you can go ahead and free it s-brake itself just gives just tells you the new address of your heap like the new bottom address yeah yeah if you call s-brake I guess we don't do a lab on it but if you call s-brake with null it tells you where the top is and then you can set the bottom with an argument so then you know how big it is so if you're malloc you're the one that makes those calls and you manage the heap and yeah so kernel initializes all that this we saw before I'll wrap up quick so it would also put like a guard page at your stack that doesn't map to anything that would make sure it didn't valid so instead of running through your stack and changing some random data instead it would generate a page fault and then kill your program so it doesn't just do wacky stuff so that's fun stuff you can do when you control pages yeah yeah we don't see that that's for more advanced courses you don't have to worry about trampoline yet the kernel can also do more fun stuff with virtual addresses so it can eliminate some system calls by just mapping memory so like clock get time won't do a system call it just maps in virtual memory that lets you read that value directly and you don't have to do a system call so it's nice and nice and fast especially when you have like a nanosecond timer and yeah page faults let the offering system do all that copy and write stuff I'll lazily allocate pages which is why it lets you allocate more memory than you have and then crash whenever you try and use it or do fun stuff like swapping that we'll see later so we saw page tables translate the addresses single large is wasteful multi levels save space for sparse allocations but slower kernel allocates from a free list and then the TLB speeds up that slow process so uh yeah that's pretty much it so if you want let me know and I can like do an example program of this and we can set up page tables ourselves if it doesn't quite make sense because sometimes this is confusing all right well with that have a good one and I'm pulling for you