 Welcome back, everybody. We are going to continue and finish up our discussion of demand paging a bit and then move on and talk about some IO. It's hard to believe we're already on lecture 17, but anyway, welcome to CS162. If you remember last time, we were talking about the notion of using the virtual memory system to build essentially a cache, which we call demand paging. And we came up with this effective access time, which looks very much like the average memory access time. And the key thing to note is this simple equation here, which basically says memory access time from DRAM, say 200 nanoseconds, page fault going to the disk, maybe eight milliseconds. And we built, keeping our units constant, of course, we built ourselves an effective access time. And what we see there is really this value of p here, that if one access out of a thousand causes a page fault, your effective access time goes up to 8.2 microseconds, which is a factor of 40 larger than the DRAM. So clearly one out of 1,000 is not a good idea. And notice I'm talking about DRAM here with 200 nanoseconds, if we were talking about cache, it would be even faster and a bigger slowdown. And so we can do this slightly differently. We can ask, well, if we want the slowdown to be less than 10%, then what do we have to have for a page fault rate? And we find that it can't be any larger than one page fault in 400,000. So this means that we really, really, really have to be careful not to have a page fault, if we can at all avoid it, which led us to basically considering our replacement policy as being very important to try to keep as much data that we need in the cache as possible. Now, we went through several policies last time and we talked about how LRU was a pretty good policy, but impossible to implement. And so we came up with this clock algorithm, if you remember. And the reason it's called the clock algorithm is because it looks like a clock. We basically take every DRAM page in the system and we link them together. So typically in an operating system like Linux or whatever, that means that every physical page or range of physical pages has a descriptor and those descriptors are linked together. And we have a clock hand, which says which page we're currently looking at. And we're gonna work our way through and on every page fault, the clock algorithm says, what do we do? Well, we sort of take a look at the hardware use bit, which is usually in the page table entry. And if it's a one, it means that the page has been used recently. And if it's a zero, it means that it hasn't. And so what we're gonna do in general is we're gonna advance the hand and we're gonna take a look at the use bit. And if the use bit is zero, we're gonna assume that it's an old page and therefore we go ahead and reuse it. If it's a one, we know that it's been used recently. What do I mean by that? Well, if we see a one and can't reuse that page, we set that use bit to zero again. And then we go on to the next one and we keep repeating until we find one where the use bit is zero. And the key idea here then is that if we see something that's a one, it means that the page has been used since the last time we came around the loop. And so really what we said was, yes, this is not LRU, but it divides the pages into kind of two categories. One that is recent pages and one that are older pages and we pick an old page. Now it's the number of pages in the clock, the number of total pages. And the answer is it's the total number of pages in the system. Okay, now the question here about is it the number of pages in the page table? The reason that question isn't quite what you thought you were asking is that every process has a page table. So there are many page tables in the system and each of them point at parts of this. So what's in this clock is all of the physical pages not the pages in the page table, okay? These are many page tables. And the hardware does not set the use bit to zero unlike what was on in the chat here. What happens is the hardware only goes from zero to one when the page has been touched. The operating system sets it to zero and it sets it to zero when it's decided that it's not gonna recycle that page, it sets it to zero and moves the clock hand on to the next, okay? So the operating system sets it to zero, the hardware sets it to one, okay? Are we clear, everybody? And the other thing we talked about last time and you should go back and take a look is how to emulate this bit. So the use bit and the dirty bit which is typically tells you that the page has been written, both of those can be emulated in software if you're willing to take more page faults and I talked about that last time. All right, the other thing we talked about was the second chance algorithm which has the same goal as a clock algorithm which is to find me an old page. Notice how I said that, an old page, right? We're looking at an old page, not the oldest page. So the second chance algorithm has the same idea and this was designed in the VAX VMS where through various reasons the hardware didn't have a use bit. And so this was a different algorithm than clock and the idea here is two groups of pages, the ones in green are mapped and ready to use the ones in yellow are there and they have their contents but they're marked as invalid in the page table, okay? And in page tables. And so now what happens is the ones in yellow are put together in an LRU list, the ones in green are handled FIFO and what we do is the following. So these green pages are the only ones that we can actively access in hardware without doing anything. If we happen to touch a green page, we're good and we can go forward, okay? If we have a page fault, it would be because the page we're looking for is not in the green area. Now it might be in the yellow area and if it's in the yellow area, what we're gonna do is we're gonna pull the page from the yellow area into the green area just by reassigning it's what categories in enabling the page table to allow it to be used. Otherwise we'll pull off of the disk, okay? Now we can make a better approximation to LRU as I was asked about having multiple use bits. The problem is it's not really easy for the hardware to have multiple use bits but as was also mentioned in the chat, you should take a look at the nth chance clock algorithm which gets you closest to LRU. So let's look at this one now. So basically what happens is full speed for the green ones. We get a page fault on the yellow ones but we don't have to pull it off of disk and last but not least are the pages that are on disk. And so if you notice what happens here is if we have a page fault, we take the top green page and we put it at the end of the LRU list and now we have to pull the page that we're looking for into the green list. Now if we're lucky enough and it's in this second chance list, we can immediately pull it out of the middle of the second chance list, assign it to the green list on the end and we're done and we can return and start executing. And notice that the yellow again is being handled as an LRU list because we put things, new pages on one side and we pull them out of the middle and so we know that the one on the end at the very top here is the oldest in the yellow list. And so if the page is not in the yellow list, we have to pull it off of the disk and so we pull it off of the disk and put it at the same spot in the green and at that point, we're gonna throw out the oldest page from the yellow. And so this is now an approximation that gets us an old page to throw out which is this top yellow one and sort of has the same purpose as the clock algorithm. And this was designed in an architecture namely the Vax that didn't have a use bit in hardware. All right? Okay, good. Great. So the other thing I kind of pointed out is the way we introduced the clock algorithm was that every time you had a page fault, you'd run the clock algorithm to find a page. Well, of course the problem with that is many fold, not the least of which is that it means that you can't actually start paging in off of the disk until you find a page to throw out. And so the disk we know is gonna take a really long time so we wanna get started as soon as absolutely possible. So instead of basically running the clock algorithm when we have a page fault, what we do is we just keep a free list. And the free list is some number of pages that are ready to be reused and they're like the second chance list. Okay, so they're not mapped. I should really make these yellow I guess but I like the red and green combination here but call this a second chance list. And we have a demon called the page out demon which works its way around trying to find enough free pages or enough old pages to put on the free list. And at the same time, we can also have ones that happen to be dirty. We can write them out to disk so that by the time we get to the head of the free list this page is not dirty and ready to be reused. And it's just like the back second chance list except we have a clock for the active pages and a second chance list for the free list. And why do I say this? Well, if you happen to have a page fault that happens because of a page that's still in this free list we can immediately put it back in the clock ring and reuse it. Okay, so a demon is really basically a kernel thread that's always running is one way to look at that. So the operating system starts up some number of threads that are only running in the kernel and they don't have a user half or it's something that runs that started up it's startup time in the operating system and it's running with root privileges and it's running all the time that's typically called a demon as well. All right, now call it a background process if you like. So now onto where we were at the very end of the lecture. So we were talking about this idea of a reverse page mapping. So think about a page table is forward basically says for every virtual address find me a page and I can figure out if there is a mapping what the physical page is. The problem is that occasionally if I wanna evict a physical page we've been talking about when you'd wanna do that you have to figure out all of the page table entries and really page tables that hold that and the reason this is tricky is because it's possible that for a given physical page there might be many processes that point at it. We talked about when you fork processes you have a bunch of page tables that point to the same physical page we've talked about shared memory, et cetera. And so basically this is a reverse mapping mechanism that goes from a physical page to all of the virtual page table entries all the page table entries that hold it. So it needs to be fast. We talked about that last time there's several implementation options. One is you could actually have a page table or a hash table, whatever that goes from physical page to the set of page tables or processes that hold that page. And that's fine. You can build that in software in the operating system it's a little expensive potentially. Linux actually does this by grouping physical pages into regions and it deals with regions at a time since there's a smaller number of entries it makes that a little faster. But the essential idea is to basically go from a physical page to the set of page table entries that hold that physical page. Okay, now on to what we haven't talked about. So how do we actually decide which page frames are gonna be allocated amongst different processes? So we have a physical amount of memory, I don't know 16 gigabytes, whatever it is. And you got a modern cloud server it might be terabytes these days. And the question is how do we divide that physical memory up on the different processes so that, I don't know is it for fairness or what's the question there? Well, we have many policies. This is a scheduling decision. So does every process get the same fraction of memory if I have a hundred processes and I got a hundred gigabytes each gets a gigabyte. But maybe different processes have different fractions of memory that they need to actually run. If you happen to have a process that basically reuses the same page over and over again giving it a hundred gigabytes of storage is not gonna be helpful. And it's wasteful. Somebody else might need that memory, okay? And maybe the case that we have so many processes running that there's so much memory that's needed that we're spending all our time thrashing and maybe we ought to actually swap the whole process out to give our machine time to run, okay? That's a desperation scenario, okay? Well, the other thing to keep in mind is that every process needs a minimum number of pages. And the way to think of that is you've clearly got a page where the current instruction pointer is you want that one in memory otherwise you won't be able to execute and you want some number of DRAM pages that would basically be the ones that we're currently accessing. And if you don't have that you're not gonna be able to make forward progress, okay? And for instance, on the IBM 370 you might actually need six pages to handle the single SS move instruction. So there was a question in the chat that won't, don't we just figure this out dynamically? The answer is mostly yes except there are a minimum number based on the architecture of pages just to guarantee forward progress of one instruction to execute, okay? And it's not about full associativity in this case it's about making sure, because remember we have hundreds of processes it's about making sure that every given process has its minimum number so that when we go around to scheduling it we actually can execute, okay? So when we're ready to replace a page we have a couple of options. So what do we mean by replacing a page? It means we have a process that's trying to run it needs a page that's out of memory where do we get the memory from? Now we can use the clock algorithm in a global sense which is what we've kind of been talking about here, right? We have everything in the same clock algorithm and the same clock data structure and the process just gets a replacement frame from the set of all frames and whatever process loses it loses it, okay? So that is often done. That's a very common policy is basically all of the pages are in the same boat and they just get replaced using the clock algorithm. Another thing that you might imagine in which some operating systems do to be more fair or if you have a real-time operating system maybe you do this to make sure you meet your real-time goals is that each process selects from its own frames. So you assign physical memory to the processes and then when a process runs out of memory and needs to page in something it picks one of its own pages to put out, okay? So in that scenario you could have each process has its own clock algorithm to choose which page of its own is an old one and then we need some policy now to decide how to divide the pages up and maybe we dynamically choose a number of pages per process, okay? And that would be a local replacement policy with some policy for dividing the memory up probably dynamically, okay? So let's look at a couple options here. So one option is that every process gets the same amount of memory and so this is a fixed scheme. So for instance, you have 100 frames of physical memory five processes each gets 20 frames. Another might be a proportional allocation scheme where the bigger process the one that has the most virtual memory needs gets more memory and we could allocate this with some proportionality constant, right? So perhaps S sub i is the size of process P sub i internal total virtual pages on the disk. And so then what we do is we say, well, what's S i over the sum of everything times the amount of memory I got and so that fraction goes to that process. Can anybody think about why although this might sound good, this might not be a good plan, okay? We have malicious programs and abuse, but let's assume for a moment that this is not about maliciousness. Those are perfectly good answers. Yes, I like this next point here. So basically the size of the process is the size of the code, right? And so in that sense, if you take the binary and you link it and you look at the size of the binary on disk and that would be this proportional allocation scheme. Why is that probably not indicative of the number of pages that this thing actually needs to execute properly? Anybody think of any good reasons? Okay, so everybody's kind of on the chat is basically getting the right idea. And the right idea is this, when you think about it, today's programming we link in these huge libraries that pretty much they have a lot of features to them but we only use some of the features. And so the size of the code may have no reflection on the amount of code we're actually using at any given time. So you could have a really large process which is really only using a small amount of code in this proportional allocation scheme wouldn't do the right thing, okay? Another thing obviously that you could do is a priority allocation scheme. So basically it's proportional but with priorities rather than size. And so the higher priority processes get a choice of more pages to use, okay? And so the idea might be if a process generates a page fault, PI, you select a replacement frame from all the processes with lower priority. So the question in the chat, somebody had said, oh dynamic linking is a reason that this proportional allocation might not work. And then the question is, why does that have something to do with it? And the answer is, well, when your program starts running and it dynamically links a bunch of libraries, we talked about that briefly. What you're doing is you're essentially attaching to libraries that are already in memory. And now all of a sudden you've got a much larger process because you've linked in all of those libraries, right? And so that might contribute to what you were considered as your total size. And notice by the way that dynamic linking is not the only thing here if we just statically link a large library that'll increase our size as well, okay? So maybe the problem with these schemes is these are kind of fixed. They're trying to do something based on static properties of the process and maybe it'd be better to do something more adaptive, okay? So what if some application just needs more memory and some other application doesn't need more memory? Maybe we ought to listen to that, okay? And how would we tell what would be a clear sign that a process needs more memory? Anybody have an idea? Page faults, lots of page faults. What might be a clear sign that a process doesn't need as much memory as it's got? Okay, I see a bunch of people saying no page faults. Now you're never gonna get no page faults but I would say low page faults, right? So the number of pages, the number of page faults is small relative to some process that really needs them which has a high page fault rate. So we can see relative to each other that perhaps we could reallocate some of our memory and it might be a better idea there, okay? And so the question might be could we reduce capacity misses? Now if you remember the three Cs, right? Capacity misses are ones that happen because we don't have a big enough cache or in the case of page faults that process doesn't have access to enough memory. And so in this case, what we're gonna do is figure out how to dynamically assign, okay? And we could imagine that there's something like this, okay? So we have the number of physical frames we give to the process on the x-axis, the number of page faults on the y and you could imagine a lower and an upper bound which is where we wanna be. So not so low on the page fault rate that we're just using memory in a way that's not helpful and certainly not so high because we are gonna be thrashing and not making progress but maybe we wanna be in this narrow range here of between lower and upper. And so as a result, if the number of page faults is above the upper bound, we know we really need more memory and if it's below the lower bound it means that maybe we could give up some of our memory and we wouldn't notice too much, okay? And so this is a specification for a policy to assign page rates, okay? Of course, what if we just don't plain have enough memory so that we can't get anybody below the upper bound and what? Okay, so we don't have anybody in the lower, below the lower bound to take pages from to help with the upper bound. What do we do? Yeah, and then you cry, somebody said, right? Buy more, buy a better system, yep. Or maybe you swap out enough pages, swap out enough processes so you basically take a running process, you put it completely on disk, thereby freeing up memory so that the remaining ones can run fast enough and then pull the process back in off of disk and run it, okay? Because when you're in this region with a high fault rate what's happening is the overhead's so high you're not making progress and you're doing a whole lot of swapping in and out, okay? And so the only thing your machine is doing is swapping and it's doing it really well and it's doing it really rapidly, okay? Whereas if we take several processes and put them out on disk to sleep entirely, we free up memory, then we can get into this better region where we're more efficient and we're actually gonna be running much faster on the remaining processes, we can complete them and then start pulling things back in, okay? So this is a situation where swapping can make a big deal. Now there was a question about how we set the lower and upper bound. So what's gonna happen there is really based on previous experiments on your operating system, you can kind of figure out that things above the upper bound are really not making progress and things below the lower bound are really don't need their pages. The upper bound one you can kind of figure out if you look at the overhead of swapping you can kind of figure out what's that break even point which you're doing, 50, 50, half swapping, half regular, perhaps that's an upper bound or somewhere in the middle here that you don't wanna exceed, okay? So here the word frame by the way is the same as a physical page. Sorry if that's a confusing term there. Okay, so frame is a physical page. All right. So thrashing is a situation where you just plain don't have enough pages. And yes, if you could somehow buy more memory, you might help. But in fact, if you take a look here on the x-axis on this graph, what I've got here is the number of threads that are simultaneously running. So I got this as degree of multi-programming. This could be the number of processes, it could be the number of threads that are all simultaneously running. And the interesting thing about this is as you increase the number of threads, the fraction of CPU that you're using starts rising. So at some point, we have enough threads to keep the CPU busy. Can anybody tell me why adding more threads, even if you have only one CPU might give you higher utilization of the CPU? Why does it even make sense that this goes up? Okay, because if you think about, so there was something here, somebody said, there you go. Somebody said less blocking on IO, correct. So the thing is that it's not that there's less IO. What's going on is even though we have threads that are blocked on IO, we have other threads to run, and so we're good to go. And so this is helping us overlap computation and IO. Now, at some point, you hit the thrashing point where the number of threads you've got is just way too high and you're doing nothing but overhead. And what you get is this precipitous loss of performance. So it's not just that this levels out, but that it just gets bad and everybody does poorly. And that's because you're spending all of your time going on and off of disk, and disk, of course, is extremely expensive and thereby nobody is making any progress, okay? So thrashing is this situation where a process is busy swapping pages in and out with little or no progress, okay? So the question is, how do we detect it? What's best response to thrashing? Well, clearly we would detect it by there being just a very high rate of IO going on, or excuse me, of paging going on. In fact, you could even detect that the amount of time you spend paging versus the amount of time you spend executing, far more paging, okay? When you're in that situation, you're clearly thrashing. And the best response in that situation is really to basically stop some processes, put them out on disk and let the other ones make forward progress and you'll do much better, okay? The reason that more threads lead to more paging is because they're gonna have more unique memory requirements and therefore you're gonna have a lot more paging. The other thing is why does IO help us here? The answer is if you have a single thread and it's doing bursts of IO followed by bursts of computation, then when it's doing the IO, it's getting zero CPU utilization. So you wanna make sure you have enough threads left over so that somebody can always be computing while the rest of them are sleeping on IO, okay? And you might, the choice on which ones to page out, that's a good policy question, all right? Maybe you pick the one that's got the most pages so the other ones can run, all right? Or you, there's several different policies you can imagine there. So let's talk a little bit about the needs of an application, okay? So the needs of an application or a process or a thread is based on its memory access, okay? So if you were to take, we looked at this couple of lectures ago, if you were to take a look at the memory address space on the Y axis here and you're a look at time on the X, what you see is every vertical slice represents the set of pages or the set of virtual addresses that are actively in use, right? So we could scan across for any given point in time, little window in time and we can look at all the addresses that are in use and that's actually our working set. So those are the pages that have to be in memory during that given time period in order to make forward progress, okay? Now, so one of the answers to what does a process need to make forward progress is it needs to have its working set of pages in memory and notice by the way, let's back that up, watching that cool, yes. So if you were to look at any given time slice, what you'd see is the set of pages in that given time slice is different than the set of pages a little later, okay? So if you look here is a region where the memory addresses in this region are in high use but they're not in high use for the rest of this execution time. So only when we're in this region do we need those pages in and so our working sets changing over time and we wanna make sure at any given time that the total working sets of all the processes that are trying to run or threads that are trying to run can fit into memory. And if the total memory you need for the running threads is bigger than will fit in your physical DRAM then you got thrashing, okay? So the working sets the minimum number of pages. So if you don't have enough memory, then what? Well, better to swap out processes at that point and the policy for what to do, there are many policies you could come up with. The bottom line is trying to free up enough memory that things can make forward progress, okay? So here's a model of the working set which roughly corresponds to this blue bar I showed you in this previous slide. So the blue bar says if we take a look over a period of time window from Delta to Delta plus something and I had to look at all of the addresses in that range that's the working set at that given time period, okay? And so here the working set at time T1 is really going back a Delta period what is the total set of pages that are in use and I could write those in set notation, pages one, two, five, six, seven are in use and those are the pages that need to be in memory, okay? If you look at this other time set work, excuse me, you look at T2 then you see that there's a different set of pages three and four, okay? Now, so the working set window is a fixed number of page references for instance, you might be the last 10,000 instructions that defines a working set and those are the pages that have to be in memory in order to make forward progress. And so this is actually a model and you can imagine that if Delta's too small it's not really encompassing what I need to run, okay? And if it's too large it's not gonna meet up with the different periods in the program. So if Delta is too big so that would correspond to this blue bar being too wide then I would mistakenly think that I need all of those pages as well as all of these other ones if the bar was too wide. And so it needs to be kind of narrow enough to reflect the changing patterns of the working set over time, okay? And of course, if Delta is infinity then you're encompassing the entire program and this isn't really a useful model other than to say, well, here's all the addresses that the program uses, right? That doesn't have enough of a time component to be helpful, okay? So this is a good question in the chat won't we give a lot of memory, right? As processes change their working set. So the answer is really that as if you look at the clock algorithm what happens is that dynamically adapts. So as the working set changes what really happens is the old pages aren't the active ones and I bring in new ones. If I wanna be more sophisticated about what's going on here and I see a changing working set then what I'm really saying is I'm never gonna have more pages than fit in that say 10,000 instruction scheme. And if I'm really gonna build a paging scheme based on that then as I go through what really happens is I sort of say, oh gee, those pages I had before I don't need any more, but I need these new ones and you could let those old pages be used by some other process that's getting some new ones, okay? So the page faults, this is kind of averaging over time. So as you move forward, the page faults aren't gonna get any faster than they would otherwise just by this model. This is really trying to model what pages we need to have in core to make progress. And if you were to add up all the working sets for all of the running processes then you get an idea of how much total memory you need, how many total frames and that gives you an idea whether you're a thrashing situation because D is greater than the total memory you've got, okay? So the policy sort of is if the demand is greater than M then you suspend or swap out processes until you can make forward progress. And here the word swap, when I say swap out a process that means put the whole thing out on disk and free up its physical pages so that other things can use those physical pages. Now, M here is total memory, okay? So M is what I've got available for my memory. It's about a DRAM. Now, let's talk a little bit about compulsory misses. So compulsory misses are misses that occur the first time you ever see something. This might be the first time you ever touch a page or after the process has swapped out and you swap it back in, all right? This could be a source of compulsory misses after a phase where you've pushed the thing out. So the question here are demand frames basically page faults. Right now, if we're doing demand paging what we're saying is we bring a page in as a result of a page fault. So demand paging is the same as pulling something in dynamically as soon as it's needed. The reason for looking at the working set that we've done is one to give us a better idea how many pages we really need but two it can actually lead to a slightly more intelligent paging in, okay? So you could say that we could do something called clustering which some operating systems do, which says on a page fault, what you do in is you bring in multiple pages around the faulting page. That's a form of prefetching. And since the efficiency of disk reads increased with sequential reads which we'll show you as soon as we get to disks it makes sense maybe to read several pages at a time rather than just the one that you page faulted on. So that's a way on a demand page miss to pull in slightly more pages than we're asked for as a way of trying to optimize our page faults and our compulsory misses, okay? Lower than compulsory misses. The other is actually to do a real working set tracking which is to try to have an algorithm that figures out what the current working set is for a given process. And when you swap the process out and then bring it back in maybe you just swap in the working set as a way to get started and thereby avoid the compulsory misses, okay? Now let's look a little bit about what Linux does. So memory management in Linux is a lot more complicated than what we've been giving of course, but it is interesting to take a look at what they've settled on. So among other things Linux has a history that tracks some of the history of the x86 processor. And so Linux actually has at least three zones. It has the DMA zone, which is memory less than the 16 megabyte mark. Originally these were the only places where DMA worked well on the Isobus. I'll say more about DMA in a couple of slides but this is the direct memory access. There's a normal zone which was everything from 16 megabytes to 896 megabytes. And this is all mapped up at C0 for the kernel. I'll show you that in a moment. And then there's high memory which was everything else, okay? Every zone has its own free list and two LRU lists which is kind of like they each have their own clock, okay? Many different types of allocators, okay? You started looking in homework four, you've been looking at ways of making malloc and so on. Well, if you look inside the kernel there's several different allocators. So there's things called slab allocators, per page allocators, mapped, unmapped allocators. There's a lot of interesting things there. There's many different types of allocated memory. So some of it's called anonymous which means it's not backed by a file at all. Some of it's backed by a file. So once we get talking about file systems a little more we'll look at some of these uses of memory. There's some priorities to the allocation is blocking allowed. So if you remember we talked about how things like interrupts aren't allowed to go to sleep because the interrupt has to be short, okay? Well, blocking that's going to sleep might or might not be allowed in your memory allocator. So if you can imagine you have a kernel malloc one of the things you need to tell it is if you don't have the memory I'm asking for are you allowed to put me to sleep or not? If you're in an interrupt handler the answer's gotta be no because if it puts you to sleep you basically crash the machine. On the other hand, if you're coming in from a process maybe getting put to sleep's okay. So that's the difference between blocking or not blocking and the allocators inside the Linux kernel have to make that distinction. Okay, so here's a couple of interesting things I wanna show you. So this is pre meltdown. I'll say a little bit more about meltdown in a second but back at a couple of years ago we basically had a 32 bit address space looked like this. So there was three gigabytes for the user and another gigabyte for the kernel. And what this is is the kernel would map not only its kernel memory but also every page up to 896 megabytes were also mapped up here. Okay, and then the user space had up to three gigabytes of virtual memory that it was allowed to use. Now what's interesting about this is of course what's in red isn't available to users. So if users try to use this, you get a page fault and ultimately a core dump but as soon as you went from kernel or excuse me as soon as you went from a user to a kernel like by a system call, these addresses are already mapped in the page table and they're ready to use. So all of the kernel code is up there. All of the interrupt handlers, all that stuff and every page in the system is up there. All of that's available for immediate use as soon as you go into the kernel. When you get to 64 bit memory which is considerably bigger. So notice that we only have 32 bits of a virtual address. Here we have 64 bits of virtual address. It has a similar layout but basically 64 bits give you a lot of memory. So much memory that nobody has that much DRAM yet. And so you not only don't have that much DRAM you don't really have that much virtual memory even. And so what happens there is even though in principle you could map any virtual address to any physical address what happens in real processors is there's actually what's called the canonical hole in the middle, okay? And that really reflects the fact that the page table only works up to say 48 bits of virtual address and notice the idea here is that you'd have 47 ones from all zeros to 47 ones gives you the user addresses. And then at the top of the space from all ones down to 47 zeros gives you the kernel addresses and then everything in between is basically not assignable. So any attempt to touch that part of the virtual space would cause a page fault, okay? And so this layout really reflects the fact that you don't even have all 64 bits worth of virtual addresses. Now somebody kind of joked in the chat there that yeah, we don't yet have 64 bits worth of physical memory but yeah, someday probably will happen. There's already people talking about 128 bit processors. I mean, those exist. So I don't know, things keep getting larger. Okay, so let's look a little bit more about what we had here, okay? Now, if you look again, what's great about this arrangement is that every page is available, every page is available in the kernel and up in this space and all of the kernel code and everything's available up in this space. And so it really makes it easy for the kernel because it can touch any page, it can touch any of its code and it can basically manage those pages easily, okay? And one of the things is that in general, those red regions are just not available to the user. There's a couple of special dynamically linked shared objects that are available to the user and those are moved around randomly. Every physical page has a page structure in the kernel, they're linked together into the clock and they're accessible in those red regions. For 32 megabit architectures as long as you have less than eight or 96 megabytes then every page not only was in some user's page table but it was also available in that red region up there for the kernel to touch. So it actually had double mappings, okay? And then for 64-bit virtual memory architectures pretty much all the physical memory is mapped above that FFF8 range, okay? So this 896 megabyte number comes from having enough space up in that red region to map 896 but leave some extra space for the kernel and for a few other specialized addresses, okay? So needless to say the kernel's only got a gigabyte up there so you can't map four gigabytes into one gigabyte, that wouldn't work. And it turns out 896 megabytes is the max you can get above C's zero because that's just the way Linux does it. So meltdown happened, okay? So what was meltdown? Meltdown, let's go back to this map. So sometime in 2017, 2018 basically the computer architecture community was shocked by something called meltdown and what it was was it was a way that was demonstrated for user code to read out data that happened to be mapped but invisible in the kernel, okay? So even though these page table entries were marked as kernel only, the fact that they were in the page table at all even though they were marked as unreadable meant that using the meltdown code you could read data out of that. And it was actually demonstrated that you could with user code read all of the data out of the kernel which means that secret keys and all that sort of stuff was all vulnerable, okay? Which as you can imagine was not a great thing for people, right? And so the idea here is using speculative execution. Now what you gotta realize is modern processors take a bunch of instructions and they execute them out of order and a way to make everything fast, okay? And so they run them out of order and they even allow things to run ahead and do executions that aren't allowed. And the reason that's okay is because any problems are eventually discovered and all the results are squashed and it just works, okay? So if you were really interested in this I highly recommend you take 152. It's a lot of fun to learn about why this out of order execution works but the key thing here to first of all keep in mind is yes, things are executed out of order and they're executed in parallel and what have you but and they're allowed to temporarily do things incorrectly but when all is said and done it's all cleaned up at the end so the registers never reflect incorrect execution or violating of priorities or kernel privileges or anything. Nobody in the computer architecture community really thought that this was gonna be possible, okay? And what they didn't realize was that you could do something like this where you set up the cache, okay? You have an array at user mode. That's why it's green. It's got 256 entries times 4K a piece which is a page size and you flush all the array out of the cache. So this, all of these cache entries in the array are now gone and then what you do is this following code and I just wanna give you a rough idea. You say, I'm gonna try something. This is not quite C but it's close. I'm going to try to read a kernel address that I'm not supposed to, okay? So it's up in that red region and I'm gonna try it, okay? And then I'm gonna take the result that I read out of it and I'm gonna use that to try to read out of this array which I have access to, okay? So I'm only gonna get one byte out of the kernel. I'm gonna use it to access something in the array and then if I get an error which of course I'm gonna get an error because I'm reading kernel code, it gets caught and it's ignored, okay? And why does this do something? Well, this does something because the processor is running all of this stuff ahead in its pipeline. It goes ahead, it does the read early. It accesses the cache early and then it says, oh, you weren't supposed to do that and it squashes all the results so the registers don't have anything in it but I have touched the cache. And now the cache has got an entry in it depending on what the value was I read back. So one of 256 cache lines is now in cache. And so then all I have to do is scan through and find the one that's actually cached and fast as opposed to all the other ones that go to memory and voila, I just read eight bits out of the kernel, okay? And this was shocking, okay? What this did was it took the out of order execution which is there for performance and it suddenly gave you the ability to read stuff out of the kernel that you weren't supposed to touch, okay? Questions, it takes a little getting used to it but it's astonishing that this is possible, okay? And let me just say this again, the idea here is I try to read a byte out of the kernel which I'm not supposed to. The processor is heavily pipelined so it goes ahead and reads it anyway. I use that result to touch which I try to do a read from cache in one of 256 values and all of this stuff gets squashed because the processor says, oh, that's not something you're allowed to do but the damage has already been done because I've already tried to read into the cache and as a result, one out of 256 entries in the cache has a value in it and I can figure out which one through speed by just saying, oh, that one cache entry is fast, the others are slow, okay? And as a result, you can work your way through and read out a memory. So this is bad, okay? And in particular, it's bad because all of the kernel address maps that everybody had all of these years with kernel mapped stuff up in the upper portion, I just showed you that all of that red up there, right? Okay, this type of layout had been around forever extremely convenient because basically the page table has got everything in it but it's only until you go into the kernel that these kernel addresses are allowed to be used. Suddenly you couldn't do that anymore because it opened you up to the meltdown bug. And so post meltdown, there's a whole bunch of patches that came in that basically involve no longer having one page table but really having two for every process, one that's used in the kernel and one that's used for the process. And that meant that you had to flush the TLB on every system call, okay? In order to avoid the bug, except for on processors that actually had a tag in the TLB that would tag based on which page table you're using. And only versions of Linux after 4.14 was able to use that PCID. So this really slowed everything down, okay? And the fix would be better hardware that kind of gets rid of these timing side channels and there have been fixes kind of on the way for a while and they're starting to get better. So the reason the processor does what we're talking about here is really to speed everything up because you want as much pipelining as possible and the checking of the conditions takes a lot of time just like the access. So it starts the accesses early, okay? And it's mostly fixed, okay? It's mostly fixed, but it's still a little bit surprising that this was possible at all, okay? Okay, yes, you are understanding this correctly. Okay. All right, so let's switch gears a little bit. But anyway, the reason I wanted to bring this up is A, it's an interesting bit of very recent history. And B, it actually changes what memory maps are allowed now and if you're wondering why things are not as clean as they used to be, it's partially due to the meltdown memory map. Okay, so now we're gonna switch gears and we're gonna talk about IO. And if you remember, we've talked a lot about the computer and data paths and processors and memory. We really haven't talked about this input-output issue. Yeah, Pintos is potentially vulnerable to this problem, but Pintos is not a commercial operating system. So why is IO even interesting? And the answer is really without IO, a processor is just like a disembodied brain that's busy just computing stuff. And of course, we all know that all processors auto-aspired to computing the last digit of Pi, but presumably it'd be nice if we were able to get the answer out. Okay, and so what about IO? Now there is question is IO and scope. So general IO, basically I think I said everything up to today was potentially in scope. So without IO, computers are useless. And the problem though is that there's so much IO, right? There's thousands of different devices, there's different types of buses. So what do we do? How do we standardize the interfaces on these devices? And the thing is that devices are unreliable, media failures and transmission errors happen. And so the moment we put IO in here our carefully crafted virtual machine view of the world suddenly gets very messy. And we need to figure out how to standardize enough of the interfaces across all these different devices so that we can hope to program this. Okay, so how do we make them reliable? You know, cause there were lots of different failures. How do we deal with the fact that the timing is off? They're unpredictable, they're slow. How do we manage them if we don't know what they'll do or when they'll do it? Okay, all of these different things. And really philosophically I like to think of this as the fact that the world, which is what IO touches is really very complicated. And computer scientists like to think in simple ways and nice abstractions. And when the nice abstractions collide with the real world, you get problems. Okay, you get the fake news shows up, right? And so we gotta figure out what to do about this. And so if you remember, we kind of said what is IO? Well, IO is all of these buses, it's the networks, it's the displays, and we somehow have this nice clean virtual memory abstraction of processes and stuff, virtual machine abstraction, excuse me, above the red line. And storage, we have to access the binaries, we have to access our networks across that protection boundary. And all of the IO is both below that kernel boundary of processing and potentially out into the real world. And hopefully the OS is gonna give us some sort of common services in the form of IO that we can then access without caring so much about the exact precise details of the world. And the other thing is of course the Jeff Dean range of time scales where cash replacements might be 0.5 nanoseconds all the way up to the time to send a packet from California to the Netherlands and back might be 150 milliseconds, there's a big range. And so whatever we do, it's likely that we're gonna need a whole range of techniques to deal with all of these different time scales, okay? Now, so let's go and think about this a little bit more. If you look at the device rates varying over 12 orders of magnitude, here's the Sun Enterprise buses. These are all different devices that are actually on those buses. The system has to be able to handle this wide range. So you don't wanna have high overhead for the really high speed networks or you're gonna lose packets, but you don't wanna waste a lot of time waiting for that next keystroke which is gonna take a long time. So in a picture, what do we have? We have our processor, which we've been focusing on pretty exclusively. Say this is a multi-core machine which is each core has registers, an L1 cache and an L2 cache. And then those cores share an L3 cache, okay? And that's our processor. And then we've gotta deal with the IO out here. And what you can see is the IO devices are supported by IO controllers, for instance, here. And those IO controllers provide some standardized facilities to talk with the outside world. And then there's various wires and so on that communicate, okay? And these interfaces are the things we need to figure out how to make work, okay? And, you know, right, for instance, if you were gonna pull something off of SSD, you're gonna put commands into the IO controller which is then gonna reach out across a standardized bus, start the read off the SSD, which will pull it through DMA into DRAM. And then you can read and write as a result once it's in DRAM. And so there's a lot of different interesting pieces here that we're gonna have to figure out, okay? So DMA writes to, that's a good question in the chat, does DMA write to physical addresses? I'm gonna say yes for now, although there are virtual DMA protocols that can write into virtual memory as well, but usually you pin it into physical memory before you start DMA. Okay, so here's another look at a modern system. So you've got the processor with its cache, and then you've got various bridges to PCI buses, for instance, and then maybe you have a SCSI controller that talks to a bunch of disks, or maybe you have a graphics controller which talks to monitors, or maybe you have an IDE controller which talks to slower disks, et cetera. And it's really all of these different buses are part of the IO subsystem as well, okay? So what's a bus? So it's a common set of wires for communicating among hardware devices, and there are protocols that have to be satisfied on these wires. So operations or transactions include things like reading and writing of data, control lines, address lines, data lines have to be part of this bus. So it's typically a bunch of wires, okay? And you have many devices that might be on a bus, right? So this is a standard abstraction for how to plug and play a bunch of individual things onto a common bus that then can get to your processor, okay? And so there's protocols. There's an initiator that starts the request. There's an arbitrator which says it's your turn to actually talk. There may be hand shaking to make sure that no data is gone before you can grab it. So the communication is only as fast as permissible. There's also arbitration to make sure that two speakers don't try to speak at the same times, et cetera, okay? Now the closer we are to the processor, typically the wires are very short and we can get very high speed communication. The farther away from the processor, the wires are longer or you go through more gateways and the communication gets a lot slower. So things that need to be really fast are typically close to the processor, things that maybe need to be more flexible or often further away but slower. So why do we have a bus? Well, the buses in principle at least let you connect end devices over a single set of wires. So buses came up over the long history of computers as a way of allowing us the maximum flexibility to plug in many devices, okay? Now of course you end up with N squared relationships between different devices on that bus, which can get messy very quickly. The other thing is that several downsides to a bus. So one is that you can only have one thing happening on a bus at a time and that's because everybody has to listen, okay? And that's where the arbitration part comes into play. The other downside which I'm gonna point out here before we leave the bus is the longer the wires, the longer the capacitance, the slower the bus is because capacitance takes a long time to drive up and down. I don't know if you guys talked about that in 61C but basically if you have a really long bus and a lot of capacitance, it means to change a wire from a zero to a one, you have to charge it up and the more capacitance the longer that takes, okay? So buses that get too long get slow. So that kind of explains part of what I'm about to say next, which is here's an example of the PCI bus. You've probably taken a look inside of one of your computers. You can plug a card in, it's got many parallel wires representing 32 bits of communication or what have you, bunch of control wires, bunch of clocking wires. And this is a parallel bus because all of the different card slots are all connected together with a common set of wires, okay? And so what I showed is an arrow back here, each one of these slices might have another one of those connectors on it that would connect across tens or hundreds of wires in that bus, okay? And so not only is there a lot of capacitance in this but the bus speed gets set to the slowest device. So if you have a device on here that responds very slowly then everybody suffers, okay? And so what happened is we went from the PC bus to for instance, PCI Express and some of these others in which it's no longer a parallel set of wires but rather a bunch of serial communications that all tie everything together and act like a bus but is really a bunch of point to point, okay? It's really a collection of very fast serial channels. Devices can use as many lanes as they need to give you the bandwidth and then slow devices don't have to share with the fast ones. And so therefore you get the expandability of something like a bus but the speed of a single point to point wire set of wires between each device, okay? And one of the successes of some of the device abstractions in Linux for instance is going from PCI bus the original parallel bus to PCI Express really only had to be reflected at some of the very lowest device driver levels most of the higher levels of the operating system never even had to know the type of device. So that's a good example of abstraction coming into play here to help deal with the messiness of the real world. So here's an example of a PCI architecture. You know, you have your CPU you've got a very short memory bus to RAM. So these are typically a bunch of what are called single inline or dual inline modules and they're connected on a bus that typically is connected very short wires directly to the CPU, okay? And so that can be blazingly fast. And then the CPU typically has bridges to a set of PCI buses and these are serial communications and plugged into the PCI bus for instance would be a special bridge to the original industry standard architecture bus. So this was on the original IBM PC was the ISA bus. What happens in a modern system is you fake it by having a fast PCI Express bus but the ISA controller can talk to legacy devices like old keyboards and mice and so on, okay? And also though you might have bridges between different PCI buses and now typically you have a USB controllers where USB is actually a different type of serial bus and that has a set of root hubs and regular hubs and this is a webcam keyboard mouse. Those can be plugged into USB which is plugged into PCI which is plugged into the CPU. Okay, and then you can also have disks and so on. So this is a view of the complexity of the bus structures but all of this gets hidden behind proper device drivers so that the higher levels of the kernel don't have to worry about some of this complexity only the lower levels, okay? The question is, is this parallel or serial? The answer is yes, okay? Now, so basically when I say PCI I'm talking about PCI Express is serial PCI buses parallel depends a lot on what parts of the system we're talking about but basically the serial communication for PCI Express is far more prevalent than the parallel ones these days and it's gonna depend on your exact system. You can open up some of your specs that talk about your computers and see kind of what the buses are internally, okay? Now, how does the processor talk to a device? So I wanted to start our conversation here a little bit about what it is that's inside the operating system that talks to devices and so we already talked about the CPU might have a memory bus to regular memory, okay? And so that's a set of wires that typically the hardware knows how to deal with directly, okay? On the memory bus or possibly directly connected to parts of the CPU, okay? We'll talk about that in a little bit. Might be a set of adapters, okay? And those adapters give you other buses and we're not gonna worry exactly what the buses are here but what I wanted to show you is that typically the CPU is trying to talk to a device controller, this big thing in Magenta and that device controller is the thing that has all of the smarts to deal with a specific device. It gets plugged into the right bus interfaces in a way that the CPU can send commands to that device controller and read things from the device controller, okay? And some of that communication might be via reads and writes. I'll show you this in a moment of special sort that basically go across the memory bus or across a bus to the device controller and set registers that control its operation or pull data or start DMA. We'll talk about that in a moment. Also coming out of this is typically interrupts that go to the interrupt controller. Now we already had the discussion about interrupt controllers earlier in the term but one of the ways that the device controller typically says that it needs service or that something has been completed is over an interrupt, okay? So the CPU interacts with the controller, typically contains a set of registers that can be read and written. So what I've got here for the registers are ones that potentially allow you to read and write things about the device, maybe set some commands, like for instance, if this is a display, maybe one of the things you might write to this second register is about the resolution, okay? Now the device controller, the question in the chat is, is this the same as the device driver? No, this is hardware. Device drivers running on the CPU and the device driver knows how to talk to the device controller hardware, okay? So the device controller, this is actually hardware, okay? And so if you look here, for instance, we might have a set of registers that have port IDs on them. I'll show you what that means in a moment. But for instance, port 20 might be this, the first register port 21 might be the second, port 22 might be a control register, port 23 might be status. And by reading and writing those ports, I could change the resolution of the device. The other thing is I can read and write addresses, okay? And reading and writing of addresses allow me to potentially write bits directly on screen, okay? So there's two different types of access that are typically talked about between the processor and the device controller. One is port mapped IO, where the CPU uses special in and out registers that address ports in the controller, okay? And that's these special register names. And the other is memory mapped IO, where just by reading and writing to certain parts of the address space, I cause things to happen on my device. And so I wanna talk about port mapped IO and memory mapped IO. So about port mapped IO is port mapped IO is typically only shows up on things like the x86 processor or very specialized processors that have IO instructions. Memory mapped IO is much more common where you can read and write from special memory addresses and it just goes to the controller, okay? Now, region here is what region of the physical address space can I read and write to that's gonna cause things to happen here? I'll show you that in a second. Now, here's an example. If you were to go into devices speaker.c in Pintos, you'd actually see something that turns the speaker on at a frequency and off at a frequency. And what it says here is it's gonna do some stuff and talk to hardware. And it's, the thing I wanted to point out is these out B instructions, okay? Which there's a special code for that that really compiles to you see the assembly instruction inside of this routine actually runs an instruction called out B. And what that out B is is that's a an IO instruction that runs to a, that writes to excuse me an address port that's gonna touch the speaker, okay? And there's also a corresponding in B which is another instruction. So these are actually native instructions for the x86 processor that takes a port number and some data and accesses that IO device. And these port numbers typically are 16 bits or they can be 32 bits under some circumstances but they're small address space for IO, okay? The memory mapping is a little different idea, okay? For memory mapping, we have, this is our physical address space where if you keep in mind obviously there's gonna be big regions that have DRAM in them for the physical address space. But when you have a device plugged into the system you can have regions of the address space that actually talk to that device directly. So if I happen to have reads or writes to this part of the physical address space what I'm gonna do is write commands into a graphics command queue, which might for instance cause triangles to be drawn on the screen if I'm doing some cool three-dimensional rendering, okay? Or if I read and write this region of memory I might actually put dots on the screen. And then there's another region which might be commands and status results where just by reading and writing the addresses in that region I get back status or I cause commands to happen. So in the example here might be that if I were to write dots on the screen I just write to display memory and it'll just cause I can cause characters to show up there by writing the right dots, right? Or if I write graphic descriptors I mentioned here this could be a set of triangles which then I hit a command and that will cause it to be drawn, okay? Now, are these addresses hard coded? So typically in the really old days these addresses were hard coded. Now what happens is depending on what bus this is on like the PCI Express Bus these addresses are actually negotiated at boot time by the boot driver. This is not in the regular Pintas code this would be in the boot driver with the hardware over the PCI Express Bus to decide which physical addresses go to which parts of the hardware. And the reason this auto negotiation is so good is because that means if you plug a bunch of devices in they negotiate so that there are non-oper lapping addresses whereas once upon a time you actually had to set jumpers and stuff on cards before you dared to plug them in so that you didn't have overlapping addresses for your different devices, okay? All right, questions about memory mapping versus port mapping? There's a good question there. So the good question on the chat is so is data getting written to memory and then the device controller reads it or does writing to these addresses just send directly to the device? It's the latter, okay? So you don't put it into DRAM and then have it go into the controller. What happens is the act of writing doesn't go to DRAM it goes to the actual controller, okay? Now, what you can do, so the question here is why wouldn't they use virtual addressing to solve the negotiating? The problem is you need an actual physical address on the bus and then you can virtually map to it. So if your physical addresses overlap then you got a problem. Think of this like we've been talking about DRAM is our physical DRAM space. If we had different DRAM cells that map to the same physical address all chaos would happen, right? So we got to make sure that the physical addresses that are dealt with in the cards are all unique from each other. And once we've got that, then you can map virtual memory parts of the virtual address space to these physical things. And then you can give command of a device to a user level process, for instance, just by setting up its page tables the right way to point at this physical addresses. But you need to make sure that the physical addresses don't overlap first, okay? Now there's a good question of what is faster? Port mapping or memory mapping? So the answer is the memory mapped options are usually a lot faster under most circumstances. This mechanism of using ports is kind of a legacy mechanism. You often use it only to access old devices, old school devices or ones that are part of the IBM PC spec, okay? And the answer is the reason is really that mapping through memory is so much more flexible. It's a path that's been set up for large addresses and you can actually tell the cache to ignore certain addresses. So if you look carefully at the page table entries, I don't have it up today, but look at it from last time, you'll see there's a couple of bits in a page table mapping that talk about not putting the data in the cache. And you want that because you wanna make sure that all writes go straight through to the hardware and that when you read, you don't want it to be cached so that you accidentally get old data. You want your read to always go directly from the hardware into the processor, okay? Good, any other questions? So there might be overlapping. So the question is why was I saying there might be overlapping physical addresses? Imagine simply put two of these display controllers into the same machine, okay? If we hard coded which physical addresses were for that card, we now have an overlap, okay? And so that overlap needs to be removed and that's part of the negotiation process for modern buses like PCI Express and so on. Now the question about ports is ports are actually a completely separate physical address space from the regular physical address space. And so the ports go via a separate path, if you will. The data is all the same, but the addressing bits say something different. They say this is not part of normal addresses, this is part of the port map space, all right? Good, now, and you can protect this with address translation. And where do these usually get mapped in virtual memory? It depends on how they're being used. So if you're not giving the user the ability to touch a device, which you have to be very careful about doing that, then it's gonna be mapped into a part of the physical address space that doesn't have DRAM in it. And if you take a look at the typical Linux memory maps, there's gonna be some spots often in very low memory for IO. And also in high memory is another possibility too, but really it's gonna depend a lot on the actual hardware that you've got and where is their DRAM, where is there not? You need this to be in the places where there's no DRAM. And each of the buses like PCI Express and all the others, they all have their own spaces that they map into as well. Okay, so I think the right answer to that question is really, you don't really need to worry about exactly where in physical space it is, just that it gets mapped in physical space and that at boot time we make sure it doesn't overlap with anything else mapped in that same space. Okay, so there's more than just the CPU. I wanted to say a little bit about this. So this is, for instance, Skylake. I've talked a little about Skylake, but it's got multiple cores. You can have like 50 some cores in there, okay, 52. And there's typically a bus that might be a ring, it might be a mesh. Okay, there are a lot of different options. Each core has a processor in it, okay, the processor might do out of order execution, member meltdown, we just talked about that. It might have a bunch of special operations to deal with security and so on. But that's just the processor. If you look at everything else here, we've got the system agent. So that basically talks to various DRAM controllers. That's the IMC. We can also talk to other chips to give you cash coherence, okay. And then also there's a GPU in this particular, down here, the processor graphics, which can actually draw on the screen and so on, if you don't have a special GPU in your system. And so there's a lot of different pieces in here that are more than just the processor. That's kind of my point. The processors are very interesting, but all of this stuff with the system agent gives you DRAM, gives you display controllers, the processor graphics gives you graphics. And then there's integrated IO on most modern chips from Intel, okay. And so that's the memory controller. PCI Express for graphics cards. So you see coming out off the display here, typically there's very fast PCI Express options up up top here for other graphics. There's also built-in graphics, which is lower performance. So PCI Express directly on the same chip, okay. And so, you know, like in the old days, you had the processor, you had other stuff, then you had some buses and so on. Here the PCI Express control signals are actually coming directly out of the chip. And there's this direct media interface for the platform controller hub you see up at the top. This typically connects to a lot of other IO, okay. So here is an example where we have the processor, and notice this is another view we've got PCI Express, we've got DRAM, that's the DDR, we've got embedded displays and so on. And then the platform controller hub down here handles pretty much everything else that's interesting, okay. All right, so the thing to really learn about this particular slide is to understand the fact that the IO is tightly integrated and that there's a lot of really interesting IO coming off of this, okay. So the platform controller hub is this chip. Lots of IO, okay, USB, Ethernet, Thunderbolt 3, BIOS, okay, this LPC interfaces for legacy things like keyboards and mice and so on, okay. You don't need to know all of these details, but this is trying to give you a flavor for some of the interesting things we have to control, okay. So we're gonna finish up here pretty soon, but I wanted to cover a couple more things before we're totally done. So when you start talking about IO and we're gonna go into this much more detail in a couple of days, you start talking about things like, well, do I typically read a byte at a time or do I read a block at a time? So some devices like keyboards, et cetera, mice give you one byte at a time, okay. Things like disks give you a block, it might be 4K bytes, it might be 16K bytes at a time. Networks, et cetera, tend to give you big chunks. We might also wonder not just byte versus block, but are we reading something sequentially or are we randomly going places? So some devices, you know, tape is an obvious case where you have to do sequential, right. The others can give you random access like disks or CDs, okay. And in those cases, there's some overhead to starting the transfer, but then you can pull the data out in large chunks often once you've gotten to that random spot. Some devices have to be monitored continuously in case they go away and come back. Some generate interrupts when they need service, okay. Transfer mechanisms like programmed IO and DMA, we're gonna talk more about that next time, okay. These are different ways in which to get the data in and out of the device. I showed you the topology earlier with the CPU talking to the controller, but now we've got, how do we actually get the data in and out? Do we do it one byte at a time in a loop or do we ask for big chunks of data to go out automatically? That's gonna be something we talk about, okay. And so really, I think I'm gonna say this discussion for next time. So in conclusion, we've talked about lots of different IO device types today. There are many different speeds, many different access patterns, block devices, character devices, network devices, different access timings, like blocking, non-blocking asynchronous. We'll talk more about that next time. We talked about IO controllers, that's the hardware that controls the device. We talked about processor accesses through IO instructions or load stores to special memory. As you know, there are various notification mechanisms like interrupts and polling. We'll talk a lot more about polling next time, but you're very familiar with interrupts, okay. And all of this is tied together with device drivers that interface to IO devices. So the device drivers talk to the controllers and the device drivers know all the idiosyncrasies of the controllers and how to make them work. And then the device drivers, as we've discussed in the past, provide a really clean interface up, okay. They provide a clean read write open interface. They're gonna allow you to manipulate devices through programed IO or DMA or interrupts. There's gonna be three types of devices. We'll talk about block devices, character devices and network devices. And so I think I'm gonna let you go. I hope to see you in a couple of days. We're gonna have some interesting stuff about devices to be talking about next time, but hope you have a good rest of your Monday. And I hope there weren't too many of you that had the threat of power outages. I know that there are parts of Orinda, Lafayette, Moraga on the other side of the hills that are all have their power out. But all right, other people are evacuated. That's even worse. I'm sorry to hear that. I hope that you get back to your living situation soon. Have a great, have a great evening and we will talk to you tomorrow. I mean, excuse me, talk to you on Wednesday.