 All right, welcome back to 353. Thank you for joining me today. So where did everyone, anyone know where everyone else went? Is it too early or did like virtual memory scare off people? Yeah, ran out. All right, or it's 11's too early. I guess this is the first class and it's recorded, so screw it. Oh yeah, guess as much. All right, so where we left off last time talking about virtual memory. Processes each get their own set of page tables. They're all stored in some register, like SATP. That's where the root page table is, that the kernel goes ahead and manages. And we discovered that, hey, this is how the kernel and the MMU actually work. They take the virtual address, they take the virtual page number, and they divide it up into three sections. So in this case, we have a three-level page table, where each smaller page table fits exactly on a page. So because the page table entry takes up eight bytes, well, that means we can fit 512 of them on a page. So we need nine bits to index it. So we have three levels here, our offset, that's where we are in a page. We'll probably always assume, unless told otherwise, that our page size is 496. And this is how it works. So using the root page table, which is a pointer to that L2 page table, well, we know where the L2 page table is. We know what entry to get from the index. We go get the page table entry. It tells us where the next L1 page table is. We get the index for that. Then that entry tells us where the L0 page table is. We get the index for that. We use the ppn from that to get the actual physical address for that virtual address, blah, blah, blah. So any questions about this if I need to clear this up at all? So we again get to talk about it today. Yep. So a page table, it's basically just an array of page table entries. So it's just holding all the page table entries for us. Yeah, it's basically for mapping the virtual page to a physical page. Yeah, so in the page table entry, it'll tell us where it is in physical memory. And then if we can read it right to it, is it valid? Some things like that. All right, so who has heard of the word alignment before? So this is a brief aside. So to talk about alignment, which is why we don't have to store the whole address of where a page table begins. Well, alignment just means memory eventually lines up with byte 0, so it all starts there. So if pages are 4,096 byte aligned, it means they always start when the lower 12 bits are all 0s. Or in other words, if you compute it to decimal, they're all multiples of 4,096. So it'll start at byte 0, byte 4,096, byte 8,192, so on and so forth. I can't go higher than that off the top of my head. So if we didn't have an alignment, well, that means a page could just start anywhere in memory. So maybe it starts at like 7C00, which is weird. So the last byte would be at address like 8BFF, which is just really weird. So as part of the page table entry, if we could just start it anywhere, we would have to keep track of the whole address because we have no idea where it could start. Well, if everything is aligned, we already know where it starts. We don't have to bother recording that information. So if it's an aligned page, well, that means all of the lower 12 bits, in this case, are all 0, or it's a multiple of 4,096. So we know pages are going to start. If it's like page 7, it's going to start at 7,000 and go all the way up to 7FFF. So using that byte align thing, while we might ask ourselves for other data types, is this address aligned? So is address EC8 byte aligned? Or in other words, is it an exact multiple of 8? No? Yeah, no? Why no? Yes. Yeah, so C in binary is 12. Everything above that is a power of 2, so we don't even have to convert this whole number into a decimal. We can already know by just looking at it, C is 12, and 12 is not a multiple of 8. So it's only going to be 8-bit aligned, or 8-byte aligned, if it ends with a 0 or an 8, no matter what the rest of the address is. So same thing for pages. Computers like things aligned like that, so they don't have to keep track of the lower bits. Yep, so 8-byte aligned means if you start at byte 0, that address needs to be 0, 8, 16. The address has to be a multiple of 8. So we simulated the MMU thing yesterday. Any more questions about that or things we want to go over about that? Otherwise, we got lots of other examples today and other topics to cover. So we all kind of understood, whoops, this stuff we did yesterday where we just showed that, hey, with the root page table, we can go ahead, create an L2. And then in the L2, that entry points to an L1. In that L1, that entry points to an L0. And in the L0, we go ahead and we have our actual translation there. So more or less good with that because you'll be playing with this in Lab 3, essentially. So we can ask questions now or gather up your questions because at the end of the week, we'll go ahead and I'll do a primer for the next lab because it can be slightly confusing. OK, so let's see how much we actually know. So how many page tables do I need? So let's assume that our program actually uses 512 pages of physical memory in order to just hold all the values it needs to. So there are some questions I could ask you about that. So what is the minimum number of page tables this process needs if it's actually using 512 pages? And what is the maximum number of page tables it uses in case it just uses really, really, really, really, really bad addresses for each value? Any guesses for the best case here of how many page tables we actually need? Yeah, just one. So which one page table would I need? So I have to have an L2. Then three. Yeah, three, three. So in this case, my best case is three page tables. So each page table can hold 512 entries. So I could have one L2 that just has a single entry that points to an L1 that points to a single L0. And then in my L0, it is just full of entries. In this case, it has 512 entries and it's full. So I can point up to 512 pages. So each of these would point to a physical page. So if I had, if that program, oh, yep. So you always need an L2 because that's just how the MMU works. So it always starts with an L2. Yeah, you could do something where, technically, you could get away with one page table in a very weird case where you have L2 and the entry points to itself. So you could, if anyone like recursion, so you can do that. So it'll still follow the same rules. So in that L2 page table, oops, well, it would get the entry and then treat that as the L1 page table. But the L1 page table would be the same L2 page table I just used. And then it could happen again over and over again. So that's called a recursive page table. And yeah, it gets a bit messy. But you still always need an L2. It will always follow the same rules of doing three levels of page table, even if you point to the same page. Yep. So the reason we have three is because we want to support a 39-bit virtual address. So it needs to be like this and it has to have three. If we only want a 30-bit address, then we can get away with two levels of page tables. But we're just assuming a 39-bit address here. So we need three because those are the rules. All right, so let's extend this and make this a little bit harder. What about the process is using 513 pages? How many page tables could, at minimum, I need? Four, right? So this L0 page table is full. So in my best case, I would need another one. There would have to be another entry in this L2. And then this would have one entry here to a page and that would be 513. OK, so undo and let's go to the worst case. So in the worst case, how many page tables do I need? Yep, close. So the L2 page table, you only have exactly one of them. So yeah, 512 times 2 plus 1, essentially. So it's going to look like this. So worst case is, oh, that's the end of the page, yikes. So worst case is I'll always have a single L2. But I could have all 512 entries of this L2 be full. And each entry in L2 would point to an L1. And I would have essentially 512 of these ones. And then worst case, well, each L1 page table just has a single entry to an L0 page table. So I would also have 512 L0 page tables. So in total, I would have my one L2 plus my 512 L1s plus my 512 L0s. And then my worst case becomes 1,025. Yep, so an empty page table wouldn't do anything, right? They have one entry. So this is why it's the worst case. So empty, empties are useless. These have one entry, so we would actually require them. No, so when you fork, which you'll be doing a fork and lab three spoiler alert, you just have to blindly copy all the page tables. So if it has 1,025 pages, you'll have to copy 1,025 pages. Typically not ideal, right? So this would only happen if you don't, this is also why we have the indexes kind of laid out this way, right? Since the indexes, this is all in hardware, so we could put L0 here, L1 here, and L2 here. But that would kind of mess up these page tables. So if we just have normal contiguous addresses, then all the L0 indices would be right next to each other. So they'd all be on the same page. So we would fill them up in order and we would likely get the best case scenario here instead of the worst case scenario, because, well, if they're all contiguous addresses, we'll fill up the table in order. OK, so A, further questions about that? Yeah, so in this case, if we add another page, we can't get another L2. So we can only have one. And, well, they're each pointing to an L1. So in this case, if I had 513 pages, this one would just need to point to, like there'd be two entries in one of the L1s. And it would point to an L2. And my worst case just went up to 1026, in this case. Any other questions? Cool, all righty. Oops, not that. All right, so another style of question. If we want to go back into ye olden days, when we had 32-bit processors, anyone remember 32-bit processors, or are we too young? OK, yeah, we kind of do. Remember before? No, you don't know what before internet is like. Jesus, OK. So let's do this question. So let's see what they had to do back in the day. So in the day, we had a 32-bit virtual address. And since we don't really change, our page size was still the 4096 page size. But because it was back in the day, it was only 32-bit addresses, our page table entry could be a lot smaller. So we could have a four-byte page table entry. So if I want to go ahead and write these in powers of two, just so I don't forget, this is 2 to the 12. This is just 2 to the 2. So for this, we might be asked a question like, how many levels of page tables do we need for this? So in order to calculate that, well, we have to figure out first step is to figure out, assuming that we want to fit all of these entries on a page, we get to go ahead and figure out how many bits we need to actually index it. How many entries can we fit on a single page? So how many entries can we fit on a single page here? So it should be like the page size. You can keep in powers of two to make it easy. So the page size, which is 2 to the 12, divided by the size of a page table entry, which is 2 to the 2. And that should equal 2 to the 12 minus 2, because you can do that with exponents, which is 2 to the 10. So in total, we can fit 1,024 entries per page, right? So given that, how many bits do we need to index any entry on this page table? Yeah, 10, right? So it's just log 2 of 2 to the 10. So 10 bits for the index. So if you want to figure out the number of levels you need, well, it is just the virtual bits minus whatever bits we need for the offset. So that will be like the total amount of bits for the virtual page number. And then we're just dividing it up into different indices. So divided by the index bits. And we would have to round up. So we would take the ceiling of that if it didn't happen to be a nice power of 2. But in this case, let's see. Our virtual bits are 32. Our offset bits, so that's where we are in a page. So it's just log 2 of the page size. So that is 12. And here we have 10 index bits. So you'll see nicely these numbers, well, they were designed to be nice. This is essentially where we got our page size from. So it would be 20 divided by 10, which is thankfully a nice number. So it's exactly 2. So ceiling of 2 is still 2. So in this case, we need two levels back in the day. Yep. So the offset bits are like what byte we want to use in a page. So it's just our page size in bytes and log 2 of that to figure out what byte we want on a page. Because, again, memory, byte addressable. All right. Questions about that? So back in the day, we only needed two levels on 32-bit machines. And I guess we'll see why this is actually slow. We kind of argued that it'd probably be slow last class. Let's see why it's slow. But any questions about this before I move on? Yep. So why not use 32-bit computers? So 32-bit computers, if you had a 32-bit address, how much memory can you use? Yeah, so it's the upper cap. So if you only have a 32-bit address, you can only use what? That's 4 gigs of memory, right? Can you use your web browser with only 4 gigs of memory? So since we went up to, like, a 64-bit machine, well, then we had to increase the size of our page table. So we had to increase the size of our page table entry. So when we moved up, that was 10 bits before for an index. Now it's 9. So if I wanted the same thing where I only have two levels of page tables, then, and I got rid of this, I could only support a 30-bit virtual address if I only have two page tables on a 64-bit machine, which would only be a gig, which would be no bueno for most things. Yeah. So like 32-bit machine, 64-bit machine, that's the size of an address on that machine. So yeah, in this case, well, any address virtual or physical is going to be 64 bits or 8 bytes. And right now, for virtual addresses, it only uses the lower 39 bits as actual addresses, and the rest of them get thrown out. And then for the page table entry, well, it can store some more information. So it supports up to 56 bits of physical memory, which is a very large number, and it throws out the upper ones. It's just a cap overall on the address size, no matter what. So at most, we have 32-bit, on a 32-bit machine, you have at most 32 bits for a virtual address or a physical address. Technically virtual. On a 32-bit machine, that's like the max. So if this is a 64-bit machine, the most we have is 64 bits for an address, virtual or physical. Yeah, yeah. So yeah, so why we move to 64-bit machines is just because we needed to be able to address all the memory we could physically use. But you can see once we moved up to 64-bit, we added another level page table, and it actually just made things slower. So the bigger numbers make things slower. So you might find now, you'll find, especially when people first started moving to 64-bit, things got slower. So it's just like, hey, why the hell am I using the 64-bit machine, the 64-bit CPU, this sucks, I only have two gigs of RAM and everything's just slower now, what the hell are you doing? And it took the hardware a bit to catch up until people were like, just use 64-bit and didn't care. Yeah, 64-bit is a gigantic number, they don't run into the cap. You will never run into that. No, so usually processes only use virtual memory, right? So on a supercomputer, maybe it uses more than 512 gigs of memory in a process, in which case that'd be larger than 39-bits. So to solve that problem, all we would do is just add another level page table. So on the CPU, there's hardware support for four levels of page tables, so it could support up to what, a 48-bit virtual address. But your drawback is, you have another level of page table, it's slower. So you kind of have to pay for what you use. 39 is like for most people's processes, that's what we use. All right, oh yeah, hey, there's a slide for this too. So using page tables for every memory access, really, really, really slow, because I have to follow the pointers across multiple levels of page tables. And likely we access the same page multiple times, like close to their first time we access it, we'll probably access that same page again. Maybe a process only uses a few pages of memory if we are lucky at a time. And in which case, whoa, where's my mouse pointer? Yeah, so in which case, if a process only uses a few pages of memory at a time, well, it only needs a few virtual page number to physical page number mappings at a time. So in order to speed this up and not have to do all the memory accesses for like walk the page tables every time, it is a classic computer science solution to the problem and we just stick a cache in front of it. So the translation look aside buffer or the TLB, so that is a term you will definitely need to know in this course, it's basically just a cache for page table entries. So if we have our virtual address and we just split it up, we don't care about the indices because we just care about the virtual page number to PPN. Well, what the TLB is, is basically just a cache for virtual page numbers. So if you access, if a process just accesses some virtual address, it'll first look it up to see if it's already translated that address before, if it's already translated before it would be present in this table called the TLB hit and then we would just go ahead and use that cached entry to translate the address for us. We don't have to go ahead and walk all the page tables and do that, that would be slow. Only in the case where it's not in the cache, do we get a TLB miss? So we haven't seen that translation before or it's been invalidated in the cache. So if we miss, then we have to translate that address using the page tables. So that would be like L1 to L2 to L0 to finally get the translation and then when it finally gets the translation, it would add that to the TLB and use it as part of the address. So using this, remember one of our goals was to get like access time as close to physical memory as possible. So now not that much of a formula, but we can calculate the effective access time. So assume that here that there's only one page table. So there's only one additional memory access per page table. So if I had my TLB hit time, so the time it takes to translate an address that is in the TLB, well, we'll be equal to how long it took to actually see if it's in the cache plus the memory access to actually go ahead and access that memory. If it's not in the cache, well, we still had to search the cache or the TLB. And then in this case, it was two times memory. So if I have one page table, I had to go ahead, I had to do a memory access to get page table entry. And then that told me where it was in physical memory. And then I could go ahead and do the original memory access. So if I had three levels of page table, this number would be a four. So I would have to access L1 or L2, L1, L3, and then the original entry. So this coefficient here on this depends on how many levels of page tables we have. So if we want to calculate the effective access time, so it's like alpha or our hit ratio, so the percentage of hits times the time it took to actually go ahead and resolve that address, plus the inverse of that, so our miss ratio times the miss time. So if I had a hit ratio of say like 80% and the TLB search took 10 nanoseconds and normal memory accesses took 100 nanoseconds, we can go ahead and calculate the effective access time or average access time. So it would be 0.8 times 110. So 10 to look up the entry in the cache and it's there and then 100 for that memory access, plus 20% of the time, I had to go ahead and look it up, so 10 nanoseconds plus I had to access memory in the page table, then I had to do the original memory access, so worst case it would get 210 nanoseconds. So my effective memory access time in this case is 130, so it's 30% slower than actually using physical memory. Yeah, so it's number of levels plus one. So you need a memory access for each level and then the original memory access. No, so this is just to translate one address. So it will always go access L2, then L1, then L0, then we got it, and then we have to do the original memory access. Yep, there's no weird questions in this course. Sorry, alpha one, so 100% hits. So if it was 100% hits, then it'd just be 110. No, so for this, so we'll have an example. So 100% hit ratio means it's already in the cache, but the cache can't like just magically know what the translations are, right? So it could be the case where it's like almost 99%, right? If we just access the same memory location over and over and over again, it would need to figure out the original access, but then every access after that, we already know the translation. So cache is just like, it's just remembering the, it's just remembering the outcome of doing the translation before and the way they implement it, hardware course, so it'll just remember the value. And there's like different cache architectures and all that. Yeah, so how much TLB memory do we have cache memory? So there's different caches, right? So there's like caches for memory and then the TLB is a specific cache to actually just hold the translations for virtual memory. I'm unsure how large it is. Depends on the CPU, you'd have to look it up. It would be quite large. I don't exactly know how large it would be. Yeah, so, and because of that, so the cache, again, it's just remembering the values we calculated before so we don't have to go ahead and walk the page tables every time. So remember each process is going to have its own page table, so its own virtual memory. So the same virtual address and two processes might actually map to a different physical address. So you also have to handle this TLB when you do context switches. So the easiest option is to just flush the cache. So whenever we switch to a different process, well, all of those translations for virtual memory are not valid for the new process. So we just flush the cache or in other words, just delete all the current values in it because they don't apply to this process. We don't want the cache to stick around. Otherwise, you could accidentally use another process as memory. So most implementations, that's what they do. They just flush the TLB. Risk five, they have a special instruction for it. It's called sfence.vma and that will flush the TLB. So if you're looking at that risk five operating system, you'll see that instruction on x86. Whenever you use the instruction to change the root page table, it automatically flushes the TLB for you. So now, so has anyone ever thought of like, oh, yep. Yeah, flushing the cache just means deleting it. Just getting rid of everything. I don't know why we use such terms, but whatever. All right, so has anyone ever really thought about RAM being called Random Access Memory? But in programming courses, everyone tells you that you should use memory contiguously and that's faster. Everyone's been told that before? Has anyone ever questioned that? Like what the hell it's called Random Access Memory? Shouldn't it all be the same? Yeah. No, why it's called Random Access Memory is because you can access it randomly and it's the same speed. So why the hell is everyone telling you to like access things in order? No one's ever questioned that? We all blindly believed that and we were like, yeah, sure, whatever. Does that make sense at all? If it's called Random Access Memory, why? Yeah. Yeah, but if you're accessing physical memory to randomly access any byte, it takes the same amount of time, it doesn't matter. But your C program knows where things are. When you run it, it knows, it just goes and gets it. So the reason why we want to access things contiguously is literally just because of the TLB and virtual memory translation. If you were just using physical memory, everything would be the same speed. So this has to do with the caches. So we can see this in a little example. So there is this test TLB repository that has some code in it to actually test the effect of the TLB and it is written by the same guy that wrote the Linux kernel. So that's cool. So it's just a little program called Test TLB and it's really, really simple. So it just takes two arguments. It takes a size, so it allocates memory that's in that size, so it's a big continuous chunk of memory and then there is the stride parameter and the stride parameter accesses memory every that many bytes. So if I run it, if I do like these arguments, it means that I'm going to ask the kernel for a chunk of memory that is 4096 bytes and then I'm going to access every fourth byte and I'm going to see how long it takes. So if I go ahead and run that, I see that it takes like one nanosecond. Seems pretty fast. Well, what if, oops. What if I do something like this? So I ask for 512 megabytes of memory and then I access every 4096 byte. So if I run that, I see that it is hella slow. Why is this hella slow? Yeah, yeah, because if I access every 4096 byte, it means every memory access is on a new page, right? And if every memory access is on a new page, then I have to walk the page tables for every single memory access and that is why they tell you to, they should more accurately say that, hey, try to access memory that is on the same page in order for things to get faster because that's the reason. It's not just accessing contiguously, more likely they'll be on the same page but your speed benefit is from accessing everything on the same page. Yeah, yeah. So this is so much lower than four because, well, turns out that a lot of other things happened behind the scenes because the kernel might actually have to, allocate a page for that because it didn't do it ahead of time. Things like that, it would have to check for things. It has to do, turns out it actually has to do a lot of work whenever we're accessing memory. But in this case, why it was so fast is because, well, we essentially just ask for a page. The first time we access the byte on a page, okay, sure, it has to go walk all the page tables and everything like that, but so we would have like a cache miss for our first access to a byte. But since everything else is on the same page, we would have essentially like one miss and then we'd have all the other accesses, so 1,023 of them would all be hits. And then it would be really, really, really fast for the rest of the accesses. So we can see, if we do something kind of in between, well, this is the one that's in between. So we got 16 megabytes of memory and then we accessed every 128 byte. Then it's like 15, it's a lot, still a lot slower, still not as bad, but still not this. So the reason they told you to access things integrously again, they should have more accurately say, access things on the same page so we don't have to walk damn page tables because that's slow. Yeah, what do you mean? Like this times 30 is this? All right, so how much faster is it? It is, or slower? Yeah, it's like 32. Ah, that's more or less a coincidence, yeah. Yeah, so the number of memory accesses, like it's just this divided by this. So the second one's bigger, but the number we get is the number per access. So it'll figure it out. Yes, yeah, so this is the number per access. So accessing things on the same, much, much, much, much, much, much faster. All right, so questions about that. Now we figure out why some programs are slow because well, if they use memory on different pages, yeah, it has to do all the translation, even though physical RAM might be much, much faster than that's why. So other things, so we saw that like SBRK system call, that's for user space memory allocation. So that system called like grows or shrinks your heap. The stack has like a set limit. So for growing, what the kernel will do is usually it's really, really fast. So it'll just grab pages from the free list to fulfill the request. The kernel can go ahead and set like the PTE valid bit, some other permissions. And memory allocators use this heap space. So like malloc will use it. Memory allocators will at the kind of more towards the end of the course will figure out how they work, but they're kind of difficult to use and rarely you shrink that heap. So it might actually like just stay clean for the whole duration of the process. Kernel can't free pages. So some memory allocators will actually ask the kernel for memory directly using mmap to bring in large chunks of memory and we'll play with that system call a bit in the next lecture, but that's basically like the process trying to ask for virtual memory. All right, so this one is if you go into the details of that XV6 kernel and actually go into like how the kernel lays out the address space of a process, it looks a bit something like this. So you know how null is always zero and then if you de-reference it, you get a null pointer exception. Well, that's because the kernel has set aside and will always mark the zero with page as being invalid all the time. So you're not allowed to access it. So if you try and de-reference null, let's just address zero. So that's how it knows you seg faulted because it always guarantees that that page is invalid. So other things, the kernel will be responsible to do when it like loads your program into memory. It'll put your instruction somewhere, which will take up some pages. It'll put the data somewhere, which is your global memory, which will take up some pages and then your stack will have like a set size. In this case, it grows downward, most stacks grow downward and everyone's encountered like stack overflow before. So you might be asking, oh, well, how the hell does the kernel know that I overflowed my stack instead of I just accessed random old bytes? So how it figures that out is your stack has a limit. So it has a set size. So it can only fill a certain number of pages and what it does is it inserts something called a guard page right at the end of the stack. So the kernel knows what page the guard page is. So if you try to access memory, if you overflow your stack a little bit, you'll be accessing that guard page and then it will mark it as invalid and the kernel will go ahead and see, oh, if you try to access memory and you try to access this page, well, likely you overflowed your stack. So I'll tell you stack overflow instead of just a seg fault. So if you like really overflow your stack, you can like jump over the guard page and go back and do get a seg fault or it's just memory so you might jump and just start reading some random global variables or read some random instructions. Who knows, don't overflow your stack that bad. And then there's some heap and then there's these two weird things which we don't really have to know for this course. It's more to just explain if you look at that XV6 kernel, this is the tricks it does to actually handle system calls. So it puts like two pages into the processes memory. So the first is called a trampoline. It's a really weird term but that's actually the code the kernel uses for the interrupt handler. So it will take the code for the interrupt handler and then map it to the same address in every single processes address space. So if you do a system call that triggers an interrupt and it actually runs code that has been injected essentially into your process. So you're not allowed to modify this code otherwise that would be bad but through virtual memory every single process can access this code. And then it also has like a trap frame that exists within the processes memory space and that's where it writes like all the state about it. So like the values of its registers and things that need to save, it saves it there. So it doesn't have to go ahead and switch into the kernels address space to do that. So outside the scope of this course but if you go into the details of the kernel those are some of the tricks we have to do. Some other tricks the kernel can do, yeah. So what's the difference between heap and stack? So stack is where all of your local variables are stored and all the C arguments and everything assuming we're talking about C and the heap that's where like malloc and stuff allocate memory. So anything that lives more than a function. All right, so other tricks the kernel can do so it can actually provide fixed virtual addresses and because system calls are slow it can go ahead and just map one of its pages into each processes user space and then we can actually do system calls which access kernel data without having to do system calls. So for example, clock get time that doesn't actually go to a system call because it's just a known virtual address that every process has access to and any process can just read that address and it is defined to be the current system time. So that's the only exception to not doing a system call and yeah, we're out of time. So just remember, boom for you, we're all in this together.