 OK, let's get going. So the topic today is kernel and user mode programming and IO systems, which naturally go together. And before we get into that, we're going to have a micro quiz like last time. So let's think about paging again. We saw inverse page tables right at the end of our discussion of paging. So let's see. Inverse page tables. Does the page table size grow with virtual memory allocation? False, good, OK. Yep, they're linked directly to the size of physical memory, so that doesn't happen. What about in the implementation we described anyway? Does the IPT get slower when physical memory is heavily allocated? True, yeah, why is that? Yeah, it's using a hash table, so as it fills up you're going to get some collisions and some chaining in order to get to the right result. Good, OK. Yeah, increasing the number of frames with LIU replacement gives the same or lower miss rate. Is that true? What? True, yes, OK, why is it true? Do you remember the name for the class of algorithms for which this holds? They're called stack algorithms. So min and LIU are both examples of that kind of algorithm. They're always, the set of pages that they have in memory is always a subset of what it would be if you had more frames. So you're guaranteed that increasing the frames doesn't cause more misses. All right, what about same question for second chance? So second chance, which is an approximation for LIU. There's increasing the number of frames for a second chance necessarily give the same or lower miss rate. No, OK, good. All right, so that one, although it's trying to be an approximation, it doesn't have that property. It has a validity paradox. OK, clock algorithm requires the operating system to keep track of page accesses as well as faults. Yes or no? Yeah, all right, good. So everything we talked about except FIFO, that looks at accesses of pages requires the operating system to be called whenever there are access of new pages. All right, so a last topic to talk about in the context of paging is working sets and trashing behavior. Then we're going to talk about the kernel and dual mode of execution and programming. And then we'll talk about the IO subsystem and sort of very high level overview of device drivers. OK, trashing, everybody has an intuitive idea of thrashing because it sounds bad. It sounds like the computer is angry or frustrated. So thrashing often is caused by an overload of processes and basically multiple processes are competing for the same physical pages. And necessarily, if you have too many processes, if they're asking for more physical memory than you have, then some of those processes, some of their pages need to get swapped out periodically as they're being scheduled. So that leads a very high fault rate and it leads to this sort of discontinuous trajectory of use of the CPU versus number of processes running. So you nicely increase your CPU use, assuming you have especially multiple cores. You're getting more work out of the computer as you increase the number of processes running up to the point where you start running out of physical memory. So yeah, so it wastes the CPU starts just waiting for page faults to be processed and the operating system is spending a lot of its time doing that. All right, so well, let's see, two problems. How can we detect thrashing and how can we respond to it? Well, first of all, how do we detect thrashing? Yeah, page fault frequency. I mean, as we've defined it though, it's a little bit more than the page fault frequency. So for instance, what if one process is using a lot of memory? So we've got to look at the process, sort of per process use. All right, so the assumption behind a lot of virtual memory is that there's a locality, that accesses are local in time and memory location. So here's a pattern from a particular process running over time. So here's his time here. And here are some page numbers here in virtual memory. And the gray area is showing access to those pages at that time. So the locality assumption we've made so far is that access is localized in space and time, which means in other words, if it was localized in time, there'd be a big block of a lot of activity in one column. If it's localized in space, there'd be strong rows. And you can probably see from this that there's both row and column structure. The locality in space is the row structure, and the locality in time is the column structure. So yes, we're seeing some good spatial and temporal locality here. All right, and here's a cursor sweeping through so you can see the snapshots of locality. And you can see the locality in a certain sense is even more coherent within a time slice. So roughly speaking, in this time slice here, this one, you have pretty much the same set of pages being accessed, modulo a few stragglers outside. So that suggests an explicit model of those sets of pages, which is called the working set model. OK. So the idea is if we can efficiently handle and fit this working set, then we'll avoid swapping except when there's changes in the working set potentially. So it's a reasonable behavior. If this window is of the order of seconds or a significant fraction of a second, then swapping on that time scale is not bad. We want to avoid the swapping at millisecond time scale because that's how long it takes to actually access a page. OK. So if we look across processes, and we have several of them running at a given time, if we don't have enough memory for the working sets of all those processes, then that's the bad kind of thrashing that we want to avoid. So one way to handle that is to swap out maybe one of the hungrier processes, or at a minimum swap out some of them so that the total working set that's running in memory is within the physical memory limit of the computer. So working set, you can look at a timeline of page references like this. Here are some numbered pages. And here's a time window that has to do with the time frame and within which we'd like to avoid a lot of swapping. So it might be 10 to the thousands of accesses. Depends on the speed of the machine. But something big enough that it will both catch the coherence of the page accesses and avoid too rapid swapping. Well, there's a working set per process. It's normally defined per process. In terms of thrashing, though, thrashing is a behavior of a computer, so it's going to depend on the union of the working sets of all of the processes that are running. So I mean, the options we're considering are whether or not to swap out a process. So that means sort of adding or subtracting a particular working set from the total pages allocated at that time. So on the first timeline here, this is the working set. These are the pages that we're accessing in that window. And here, this is a very coherent kind of time interval here where we're only accessing pages 3 and 4, OK? All right, so that window is delta. And it's designed to be the number of page references in a given number of references. So the book defines it in terms of references. There's arguments you can make to define it in terms of the time window, because that's going to affect the paging behavior. But one way or another, it's defined somehow either related to references or absolute time. But let's say it's 10,000 accesses. Then we can define a working set for process I, WSI, as the total set of pages referenced, let's say, in the most recent window of that time. So the idea is to somehow have it map well onto that picture that we looked at. So we want to catch a window within which the set of pages is reasonably fixed and capture the entire window within which it's fixed. So about the size of that pink line, if it's smaller than the pink line, then we're missing some of the temporal coherence. We might be missing some pages that are being used really as part of the same block of execution. If the picture is a little bit more complicated than what we saw, there might be some pages that you can't quite see from that picture that are being accessed in that window within that time. So we want to make sure we catch all of the pages that are being accessed. So which one, this picture here? Well, this would be, again, the working sets are normally defined per process because there's no reason that two different processes that are not even really synchronized would have a reasonable working set. They're not coordinated in any way, so it only really makes sense to think about one process. All right, so we don't want to make it too small because there may actually be little pages maybe that we're not seeing that are really part of the same locality. If it's too big, then the code may have switched to a different part of its execution and be using different pages, and so it's not really coherent. Yeah, and if we make it too big, we're just catching all of the page references and it's not bringing us any value, it's just saying the process can access anything it's going to access. All right, so we can define the total demand frames and here's where we're summing across processes. And that's, although, you know, the processes are working with virtual memory addresses, this is how much physical memory we would need to allocate to serve all those processes in a given time slice. So what we care about is whether that D is bigger than the physical memory, because if it is, then we have thrashing. And if we want to reduce D below the physical memory, then, you know, we have a sum over I, so if we can find a process that we can throw out such that it drops that number below the physical memory limit, then we've eliminated thrashing, at least temporarily. All right, so a common policy is if we get a D that is bigger than physical memory, then swap out the processes. So it assumes that, you know, the operating system's actually measuring those, but the simple heuristics we've looked at, which are pages accessed over time, would do that. All right, so there's a simple solution to the problem, and, you know, you'd probably want to penalize the processes that are most greedy ones, yeah. Well, we didn't say we'd stop it, we said we'd swap it out. We don't have to know what's running on it, we just look at its page accesses. Well, all right, well, it's not going to be the only thing that you look at in scheduling. The processes probably still have a priority. They may have an estimated time to completion, you can look in the past and see how long the processes have been running. So I didn't say necessarily that I'd, I mean I said it's generally good to eliminate greedy processes, all other things being equal, but you have a lot of parameters that you can use. So you would use all of those parameters, but the thing is unless the process is using a lot of memory, there's not much advantage to swapping it out. Generally, a lot of the time processes have kind of power law behavior, so it's the most greedy one, which is gonna be dominating the paging behavior and not swapping that one out just may not work. Or you might have to swap out four or five other processes to get the greedy process enough swap. So yeah, it's a complicated equation to figure out, but some version of that should work. Okay, so we a little bit talked about last time compulsory misses, which are misses that happen when you first bring a program into memory. And you'll have a lot of pages that are being touched for the first time. Also when processes are being swapped out, we talked about demand paging and you don't normally swap a whole lot of pages out, but once the process goes out, other processes are gonna be generally grabbing the pages that the old process had. So when you swap a process back in, it's often going to have to move a lot of stuff into memory. Assuming it's not being, you don't have thrashing, it's been swapped out for a while, it's coming back in. So there are a couple of options for that. One of them we talked about last time very briefly, which is clustering, which is the idea that generally the accesses to virtual memory are spatially localized. So in that first graph, you saw a lot of the horizontal rows around similar addresses because you have big objects in memory. So when heuristic is to bring in blocks of virtual memory, even if you have a single fault, there's a good chance that blocks nearby are going to be used. So you can bring in a bunch of them. So clearly that'll make it faster because there'll be a single memory access if you believe that you're going to access those other pages quickly afterwards. So doing a single disk access to get all of that stuff will be quicker. And the obvious problem is that if you do have, in order to bring in extra pages, you've got to swap out pages. So you're overruling your existing protocol for swapping out pages, which whether it's a LIU or an approximation, you're pushing out things that are actually possibly useful. So you have a much higher risk of causing page faults by what you swapped out. But anyway, this is used, this does seem to be used in Linux and especially a lot of operating systems are supporting larger sized pages. Because of the observation, a lot of the time they're swapping in larger, excuse me, a lot of the time memory's being accessed in larger blocks than little 4K sized pages. So clustering is one of the techniques that's supported fairly commonly. Another approach that has been tried a few times is working set tracking, which is essentially recording from previous executions of a process, what it's working set use was. Then when it gets swapped out, you sort of record the pages that were part of a working set. Then when you bring it back in, you try to bring in those same pages. And so it's an appealing method and I know people have tried it, but I couldn't find this working, examples of this actually working better than LIU say or the Linux approximation of LIU. So it's a good idea. It may just need a bit more intelligence into the modeling of the working set, but generally speaking, LIU tends to do better than this. So in other words, global page, a global process, a global page swapping does better than trying to be smart about the pages of an individual process and modeling that. All right. So just to quickly review our model of address translation for multiple processes. Typically, processes have some segments and typically code data heap and stack, possibly others if you break up the code into different segments. They're in a virtual address space. There's a translation map for each process, which might consist of hierarchical page tables or inverse page tables. That maps to physical memory, usually in some non-contiguous way, and same thing for all the other processes. And the operating system has its own custom kernel pages down there. So dual mode operation is about setting safety boundaries between what the kernel can do and what user code can do. So for instance, is it a good idea for applications to modify their translation maps or page table entries? Well, the answer's already there. Sorry, so obviously not because you could access the physical memory. You'd be able to set access to pages that the operating system hasn't assigned you and often can be pages that belong to other processes. All right. So that has to be implemented somehow and virtually all processes these days have at least two modes. A kernel mode that's understood by the hardware and a user mode. So the bits are set in special control registers and the kernel is the only process that can set those bits. So it's often a little bit more complicated than that. So in Intel processes, there are actually four levels of protection, although because most code is written for user or kernel level process protection, most of the time, in fact, almost all the OS kernels for Intel hardware use only the kernel mode and the user mode. But anyway, there are intermediate modes there that can be used. So the idea of user mode is to protect other processes. So you can think of it as a sort of a padded cell with no sharp objects or perhaps a kindergarten might be a nicer metaphor. But anyway, so user code can't change the mode in order to access critical resources, which the kernel has access to. Shouldn't modify any of the memory translation maps and also shouldn't have direct access to memory except through the translation maps. All right, what other things should be protected? Yeah, good. So scheduling, although there's some communication usually between the kernel and user code, so user code gets to specify priorities, but user code doesn't get to specify exactly when it's run. The scheduler should do that. And the user code can provide a variety of hints to the kernel. It might also provide hints about what kinds of pages it could access, but the kernel is always free to ignore that. All right, anything else? All right, well, typically the critical resources like the display, discs and so on need to be operated in a certain way. Some devices don't care so much, but if anything that has access to low level hardware layers potentially can break the machine. If not literally then figuratively, you could end up with a blue screen or a keyboard that no longer functions which gives you no easy way to recover. So the IO resources are pretty important too. All right, so we have to share execution between the user and the kernel. And in particular, the user often wants the kernel to do certain things for them and the kernel wants to communicate information to the user. So we have these transitions from kernel mode to user mode and that's basically getting into the safer configuration and also from user to the operating system mode. One thing I should say about this metaphor, I mean it's sort of the right thing in that within the padded cell, one way to think about this is within the padded cell of the user mode there are a lot of potentially dangerous objects, but they're all toy objects, they don't really break things. You can basically ask a device driver or device to do something which would be dangerous but the kernel prevents that from happening. In kernel mode on the other hand, there are a lot of potentially dangerous things that you can use and for that reason, the kernel's typically a lot more restricted in the kinds of operations that allows you to do. So we'll see that a little bit later. But it's important to keep in mind that there are different set of parallel protections that exist in kernel mode beyond preventing people from doing things. Okay, we'll get to that, I'll be a bit clearer later on. All right, so how do we get, let's see, so how does the kernel create a process? So that's the very first step in running a user process. And this is sort of a cumulation of ideas we've seen before. You need to allocate and initialize the process control block which the schedule is gonna use to schedule and run the process. Read the program instructions off disk and any constants that are part of the program. Store that in memory. You need to set up a memory translation map and that translation map at minimum needs to be pointing to the code to be executed plus any constants that it might need. And the rest should happen automatically by demand paging. So other instructions or memory locations that the program accesses will cause page faults and then cause those pages to be swapped in. To run the program you have to set appropriate machine registers, probably a function call of the main program. Set the hardware pointer address translation pointer to the translation table. Set importantly set the user status word, excuse me, the status word to user mode. So it really is gonna execute in user mode and the hardware will prevent a whole bunch of things from happening and finally jump to the program. All right, so this is the new stuff here which is setting the control mode for user execution. Okay, and now switching is more or less what we've seen before, context switching. You've gotta save and restore the state of critical registers. Simply swap the hardware pointer. So we've seen this a few times. The nice thing about virtual memory and memory mapping is the maps themselves are in pageable memory so you don't need to move them when you simply do a context switch. You just need to change the pointer that's pointing to the top level map and possibly flush TLB entries depending on whether the TLBs have process IDs or not. You either have to flush the whole table or maybe just a few entries. So that should be enough to get you into the context of the new process. All right, and going back the other way from user to kernel. So all right, so here it's a little bit more delicate because we don't just want the user program to call arbitrary kernel code and that's sort of like the padded cell with lots of open windows. So instead, well how does the user code get into the kernel? Everyone's programmed this by now, yeah. Yeah, so you have system calls which are very restricted. They're restricted in number and type and also in how you actually invoke them. They have this integer identifier not even address into the kernel. So a user process executes that calls a system call and that formally causes a trap instruction which does put you into kernel mode but on the other hand, it will only allow you to start the execution at one of the addresses that's a system call address. So it's pretty safe. So then you have to execute that specific system call and if you gave it strange arguments, it should abort with an error message without doing any real damage and return back to you. Okay, so the system call is a voluntary procedure to take you into the kernel in a controlled way that's more or less that's about as safe as it could be and because of the limitation, well, we just said you can't call arbitrary kernel routines. You have this fixed set and each of those should have some error checking in it. So even if you give it sort of garbage as arguments, it shouldn't do anything bad. So that index and the syscall vectors which are managed by the operating system prevent mostly prevent bad things from happening and the standard's pretty good across versions of Unix and therefore Mac operating system now and the POSIX standard which is best known for a standardization of threads actually does standardize most of the C headers as well as threads. So that does imply that in particular system calls follow certain standards. All right. And well, this is gonna be really familiar because it's in project one. So we'll keep going. The argument passing, I think, I don't think you saw this. Actually, this is even gonna be more obvious in project two, both the way at the assembler level that system calls are made and also how the arguments are passed. So most of the time to save access to memory system calls actually like normal calls use registers for the first four arguments. So it's very desirable for system calls not to require more than that number because if they do, the other arguments have to sit and use a memory. They have to be copied then into kernel memory. But anyway, so normally it's just a few arguments. Most of the system syscall functions satisfy that constraint. Okay, and a system call is a trap. It's like an interrupt, although it's caused by software. And it's called a synchronous exception because it occurs in a specific part of the code of the user code and it's synchronized with the execution of the code. So it's somewhat repeatable as you run the program. So all of the usual exceptions, execution exceptions like divide by zero, illegal instructions, bus errors or segmentation faults cause traps, specific types of trap into the operating system. So the page faults, although they don't generate an error, they just cause the operating system to respond and fetch the page. So those kind of software-generated exceptions contrast with asynchronous exceptions which are the normal hardware exceptions. So things being generated by timers or IO devices being ready for an operation or finishing an operation. So yeah, and you can disable interrupts by masking traps in general, you can't. Okay, so just summarize the last couple of slides. On a system call exception or interrupt, the hardware enters kernel mode with interrupts disabled. It sort of helps the kernel stay in a sane state where it's dealing with the current exception. Saves the program counter or jumps to a candler in the kernel similar to us, while generalizing a system call. And depending on the processor architecture, either the hardware saves registers or maybe the software is responsible for that. Okay, so let's quickly look at the architecture of the IO system now. So we've talked about user in kernel mode so far. And one of the most important users of that boundary is to allow us to do device drivers which for a lot of the time have to be managed by the kernel but require also or provide a lot of useful functionality to the user. So device drivers cross that boundary somehow. So a PC these days has a CPU, perhaps several of them, some kind of bridge to memory. These days, most of the time, the main bus on which everything else hangs is a PCI bus. And in fact, most computers have several PCI buses at the top level. The disc controllers access the CPU through that main PCI bus. SCSI controllers, those are more common on servers. SATA discs are more common on end user PCs. The other types of buses like USB and so on, again, they sit under the PCI bus, PCI bus, excuse me. And then either SATA used to be ID, there'll be one or more disc buses. The graphics controller these days sits directly on a PCI bus. And outside of that core, there's all of the physical devices that you might want to attach to a PC. And those devices between them have an enormous, probably seven or eight order of magnitude range of needs in terms of bandwidth and in some cases memory capacity. All right, so what's the role of IO? Well, to do interesting things, perhaps other than just raw computation, you want IO devices. But as we just saw, we have potentially thousands of devices with different needs. And rather than having thousands of different individual pieces of code to deal with them, we want to standardize on a relatively small number of device drivers or at least classes of device drivers. We also have to worry about the reliability of devices. So they can fail and certainly communication can fail. And it doesn't have to be any problem with the PC that you're working with, anything on the network or at the other end point of a connection can be causing trouble. And devices are unpredictable and often slow and they seem to get slower over time for interesting reasons. So, all right, if we look at the structure of a kernel to manage this sort of thing, there's a sort of pure kernel, something like the microkernel in Mark that does memory allocation and process communication, inter-process communication. And then there's an IO system that typically contains, again, communication infrastructure, synchronous response to interrupts and a few other resources that are shared across the actual device classes. We'll see some of those in a minute. Those include things like APIs for the PCI bus or for USB. And then sitting under those are specific drivers for specific devices. Now, you can see there's a sort of a boundary here between hardware and software. We're not gonna worry about the hardware right now. What I did wanna talk about is briefly this architecture of the kernel with the device drivers loaded. So, obviously, when you buy an operating system or buy a PC, you don't know what devices might be connected. So, when devices are loaded, there is a fairly significant contract that has to be made between the device driver and the kernel. So, in Linux, that's a module load operation. And it's a lot more, well, it's something like dynamically linking this code, but it's quite a bit more because as well as giving the device driver code the ability to call into kernel functions, there's a whole bunch of callbacks that are linked from the kernel back into the device driver so that when things perhaps are happening through sort of through the PCI bus or through the USB bus, the kernel IO system finds out about that and then it has to call an appropriate device driver to handle those situations. So anyway, so the kernel module load figures that out. It's kind of like dynamic linking, but in addition, it's sort of linking both ways. It's linking certain function calls in the IO kernel code, but also linking back to a greed upon function calls in the device driver code. So in particular, if you stop a device driver or you want to stop using a device, it's critical to unload not simply, I don't know, not simply pull the code out of memory because the operating system will keep trying to call those functions. So anyway, so it's definitely a more complicated process than conventionally linking and running something. All right, but what it does give you is a lot of flexibility. Certainly writing device drivers at a kernel level is challenging, but it's become a more manageable enterprise than trying somehow to build all of the potential device drivers into the kernel itself. So that's the architecture we have. We'll talk about it a little bit more after the break. All right, so any questions so far? Yeah. No, the device driver has to be written by somebody and you'll often download it from the manufacturer, although for Linux, it may not live on the manufacturer's side. It would live on one of the Linux repositories and hopefully get automatically loaded when you install the device. Yes, yeah, so no it is. So that's a very good preamble to what we're gonna talk about in a few slides. Certain types of device driver like display card driver which really have to be tightly linked with the kernel, that's just the problem. The good thing is for a lot of say display drivers, there exists say Linux drivers written by the manufacturer. It's more for smaller commodity devices where there may not be a Linux driver. And the good thing is that USB devices, for instance on Linux, the driver API is actually in user mode, it's live USB which allows you to write your device driver in user mode and it will only allow you to do certain things and the kernel generally manages access to the devices. So it can be a little bit tricky to get access to your device but once you do, there's really sort of two levels. There's sort of a kernel generic USB driver that is what your application code actually talks to and your application code still needs to be very smart about what the device registers do for instance and how to talk to that device. But there's a bit of abstraction in USB that makes that a bit easier than it would be otherwise. They have this endpoint notion which will make it easier to do that without, again without you having the ability to do much damage to the device. It seems like a good question and I haven't heard of that but yeah, I'll try to find out for you because I haven't heard of that but it seems like a good idea. Okay, so let's just review what's happening because there's quite a lot happening in the next week or so. Midterm one is happening on Monday. If you're A2L, your last name is here and everybody else is in VLSB and it's closed with the two pages of handwritten notes. Covers everything up to Wednesday's lecture and the review session is 390 host mining. We're already into project two so project two design docs are due on Thursday at midnight and also this week we have the course survey online which is at this link where you can click on the pages that are uploaded to the instructional server. So this is just the instructor's informal survey. We just want to get feedback on what's good or bad about the course. I'd personally like to get your feedback on what topics you would like covered in the course because operating system is obviously a big area. Now covering a lot of a lot smaller devices also cloud and network devices. So it'd be great to get input on what things you think are important and other than that let's take a five minute break and wrap up. Okay, let's continue. So when we're thinking about IO again because we have to span an enormous range of data rates and sizes of communication message there's a granularity question that we face which is devices like mice and keyboards are outputting very small amounts of data but you want it to be transmitted instantly so it makes sense to output a byte at a time. When you have disks they're moving data about 100 megabytes a second or thereabouts and you really want large blocks of data being moved efficiently. In terms of access some devices are most naturally thought of as being sequential. Again the mouse and keyboardists and tape devices are always producing data in a order that's important. Other things like disk is random access at some cost but nevertheless it is randomly accessible and CDs and SSDs are similarly random. The memory in a display control is certainly random access. In terms of transfer mechanism the two fundamental ways of thinking about that are either polling to check that a data operation is completed versus relying on an interrupt from the device to tell you when that's happened. So interrupts have a lot of advantages in the sense that your process can just go away and do something else, wait for the interrupt to come. On the other hand though if you're doing a lot of rapid interaction with the device polling has the advantage that it doesn't require the context, maybe significant context switching when you're actually gonna respond to the interrupt. So if you are doing a lot of writing very fast and regularly, generally polling is a better strategy. For more sporadic or unpredictable completion times for operations, interrupts make the most sense. So here's the graph showing the scales that we talked about already for different devices. So mice and keyboards obviously just a few, well actually up to a few hundred bytes a second or a thousand bytes for a gaming mice. Everything else is under one megabyte per second. This is, sorry, megabits, excuse me, megabits per second. So everything else is well under a megabit per second. Hard disks are around a hundred megabytes a second now. The buses interestingly are not much faster than the disks and SSDs are actually faster than some of these buses and a little bit faster than gigabit ethernet too. The serial ATA protocol's been growing faster than SCSI and the other protocols have. So that's now faster than the fastest single disks. And then we go up to more exotic kinds of interconnect which include in Finneban, which is what a lot of scientific computing people use and provide several orders of magnitude, two orders of magnitude here for this version overall gigabit ethernet. And this is critical for very high connectivity scientific calculations, weather prediction and so on. Here's the theoretical X32 PCI bus bandwidth, again very high, and then some even more exotic interconnects which I guess are on the process or I think. All right, so huge range. We'd like to provide a few uniform interfaces despite the wide range of different devices. So for instance, if you remember the PCI bus is sort of in the loop on all of these operations, including the very slow ones. So we want somehow very general and good interface for that. Another kind of standardization that Linux does and all Unix's do is to provide a file structure for every device and you can physically browse the device hierarchy and the bus hierarchy on Linux devices and see a whole lot more than files. You'll actually see the device interconnect hierarchy, the PCI bus hierarchy and the USB bus hierarchy and all of the interfaces and so on of all of the devices. So that creates this very nice uniform interface that allows you to use file like primitives to open devices, read their properties and read and write from them. And for devices where you don't literally use F open there's usually similar custom operators that do some kind of block read or block write. All right, and so we strive to build a device drivers that provide abstraction something like this for all of the devices. Okay, and this is gonna be a bit cursory because it's hard to get into device drivers without all of the details. But I'll include some examples and there's a link that I'll add to the, an optional reading link that I'll add to the course website so you can look at it if you want to. So typically the highest level distinction that we want to make is between block devices which move blocks of data using operations like read and write on arrays of bytes. So they allow you for instance, efficient memory mapped access which is gonna be very important soon. Character and byte devices typically do single characters at a time. They may support a line primitive, get line operations but otherwise they're very simple and close to sort of character string descriptions. There is a sort of unique class of APIs related to network devices in Unix. So yeah, they're different enough from block writes to require their own interface. You kind of want to distinguish the endpoints from the positions, the sites that they're connected to for instance. So the interface for that is a socket which is a very general kind of primitive that's reproduced in Java and many other languages. And it can either run in a blocking mode or you can have a kind of a polling mode that uses a select primitive which is for instance, if you know Java NIO it's the same kind of primitive. All right, so, okay, in corresponding to these different styles of transfer there's different ways of dealing with timing. You can have blocking interfaces which allow you to have read and write operations that will just block until they complete. Non-blocking interfaces which return with say, partially completed operation. So in fact, many flavors of read will actually only partially complete but come back to you quickly. You then you have to recall them in order to get all of the data. But they will quickly flag if there's an error. Finally, we have asynchronous interfaces which require some means of notifying you later that the operation has completed. So in that case somehow you have to give the API function something that will allow it to complete the operation. So that might be appointed to a buffer that it's subsequently gonna fill. And then the user can come either through an explicit signal or possibly allowing the user to poll to see that the operation is complete. And sending is similar. You hand a pointer to a kernel function, invoke the function, the function returns but it hasn't completed the transfer yet. You'll have to make sure that while the transfer is in progress you keep the data in that buffer the same so that the transfer can complete safely. All right. Okay, so far we've sort of looked at the general operations. The last distinction I wanted to make is between kernel and user level IO. So we argued before that it's important to have kernel level drivers for devices that interact very closely with the operating system and for which failures would be catastrophic anyway. So those would include the display driver, let's say PCI bus driver. So there are relatively few sources for those kind of display drivers. They have to integrate very tightly with the kernel and most likely they're written directly by the device vendor. User level drivers are quite popular for devices that are less dangerous, especially when there's a plethora of potential devices more than, let's say, more than you might expect the device vendors to support. Device vendors are probably gonna support Windows first before they do a Linux driver. And in that case, and this is the case for USB on Linux, there's a user level driver API which is like USB and it's the main API for device drivers through USB. And it provides some nice higher level primitives to the programmer and avoid the need for the programmer to tweak individual registers and understand the hardware. Okay, so that allows USB devices to be programmed and used effectively by programmers who are not necessarily Linux kernel wizards. Okay, another good feature of this is that, and we didn't talk about this, but the kernel level drivers because they interact very tightly with the kernel itself especially in Linux because kernel APIs are not very stable when you want to install a device driver, the preferred method is actually to compile the device driver, a true kernel mode device driver. The preferred method is to actually compile it on the target platform to make sure that every header file really matches what the driver expects. So you avoid that problem with user mode device drivers because those match, those only have to match the version of the library itself, okay, and that's more stable than the kernel APIs. All right. There are some differences in programming kernel versus user level drivers. So I mentioned this before, this is sort of the sandbox that the kernel lives in. So we want user code to not be able to change page tables and so on. Similarly, we limit the types of routines that kernel code can access. So in particular, there's a very special version of malloc called K malloc which cannot allocate blocks that are as large as malloc. They also have to be physically contiguous by default in Linux. Now dear, well, let's see, I'll ask you actually, why would you want kernel memory allocation to be contiguous? In physical memory, what's the advantage of that? There's one really important application in the device driver context. So I guess, well, we're gonna get to it in a minute. What's one common mode for doing block IO with devices other than through direct calls? What's the other common mode of rapidly doing block transfers? That, what you can, so the other common mode other than the software accessing a block is through direct memory access. So for direct memory access, the device itself grabs hold of the bus and writes directly to memory without the processor being involved. And it writes to physical memory. So it's a requirement that that block of physical memory be contiguous. So because large blocks of memory within the kernel are usually being allocated for IO, the K-Malek is block access and contiguous block access. All right, kernel code should avoid blocking calls for obvious reasons. There's no one else to schedule for you. And yeah, I'll just leave that last point because we need to get to some other points. All right, user level drivers like the Live USB driver are similar to other application programs, but they're going to be called often to process IO events. So they should work fast, ideally, or somehow postpone and return if there's more work to do. Or actually, well, this is the same, the next point here. Another nice feature of user level device drivers is that you can use threads. So in the kernel, this is almost never practical to somehow start a new thread. It would literally have to be a very different kind of a thread, but you can use user level threads with user level device drivers and blocking operations, which are usually much simpler than the asynchronous operations. So, and they're not necessarily slower either. So in Java, the blocking operations of the standard IO library, the asynchronous operations are part of Java and IO. So you have those two different styles of programming at the application programming level. The blocking operations normally require you to use threads. So they're more complicated in some ways. Like if the operation is not complete, you need a thread to sit there and wait while the rest of your program continues. But the asynchronous versions are regarded as being quite a lot more complicated. And there's literally a lot more functions in the asynchronous version. All right, so at high level, user level drivers are very desirable. You have a lot of these higher level features and simpler programming than programming at kernel level. Okay, so quick summary of PCI bus evolution. So although we're not interested in the hardware, an interesting aspect of the hardware is that it's changed a lot. It did start out life as a physical, i.e., a parallel bus with 32 bits for address and data. But as people have noticed, there's a lot of limitations of using physical buses. You have to multiplex the address and data for many requests. So this slows things down. You have a queuing effect. All of the IO operations have to be waiting on the same bus. And even worse, the slowest device must be able to track what's happening on the bus. So that implies that the bus is normally set. The bus speed is normally set to the speed of the slowest device. So PCI Express is the latest version of the PCI interface. But it's actually not a parallel bus. It's a collection of serial channels or lanes. Okay, so the lanes are the X part of the specification as in X4, X8, X16. And the flexibility of the lane approach is that you can allocate more or fewer lanes to each specific device, and they can be dynamically shared. So slow devices are only gonna consume and perhaps even partially consume one of the lanes. Faster devices can consume a lot more and actually maximize their efficiency. Oops. And the current bus speeds for PCI in an X16 configuration now for the current standard, which is version three, which devices are just starting to support is a bus speed of 15 gigabytes a second, which is pretty fast. So for a lot of PCs, this would be similar to their memory to memory copy speed. And the forthcoming version for standard is twice as much again. So there's been actually quite good progress in bus speeds on computers while everything else has stayed a lot more static. All right, so here's a quick layout of a PCI bus system, CPU. As I mentioned, a lot of systems, virtual systems now have multiple buses to support all of the stuff that's going on to the PCI bus, the display drivers, the actual, the other PCI slot devices, but also the other peripheral devices and the disk bus, SATA or whatever it is. Those all sit underneath on the PCI bus and to have enough efficient connectivity, there's normally a bridge. All right, so that's all hardware related, but it does imply between this picture of things and the previous picture, which was the true parallel bus, a radical change in the hardware. But from a software point of view, the remarkable thing is that one of the successes of abstraction in the Linux system has been that people were able to migrate code from the PCI bus standard to PCI Express without changing their code most of the time. So what they did to make this work was standardize some of the control registers and communication interfaces within the PCI standard. And although, so here's the standardized PCI control registers, and what you're seeing here is a four-byte base address and what that works as is that it's a, basically you like a channel, but it's a parallel channel. It's actually a block, an area of memory pointed to by a 32-bit pointer, which can be used for certain types of transactions. Certain types of transfer. So this abstraction hides the fact that the communications are actually happening with parallel lanes and it allows the programmer just to think of the device as doing, let's say, direct memory access into this specific base address for one type of operation. This is a separate memory channel. There are six of them all together. Any specific device probably won't use all six, but they're available and you could have, say, read operations here, write operations there, frame buffering or something else there. Anyway, so the Linux drivers did provide a nice abstraction layer that allowed the hardware to change completely under the hood without most of the application code breaking. Okay, all right, so let's see. So let's quickly look at the communication from processor to device. And here's where we're getting into the details of direct memory access. The CPU is interacting with a controller in the device. The controller typically has some control registers that specify the type of operation you wanna do. You have some kind of physical bus connection into the device through PCI or through USB and also interrupts going back into the CPU, typically through the same bus that's passing data. And the controller is responsible for processing the requests that you make, usually by writing into a register and managing them by communication, typically with some memory that's on the device. Let's say if it's a disk device, it'll have some local RAM which will contain the results of reading off the disk and that will get passed through the bus to the CPU. All right, so the registers are the interface that the controller uses to be controlled and there's typically some memory to allow block transfers. All right, and in spite of the details, on the Intel architecture at least, there's two classes of instructions that, okay, I'll get to that and say, there are two classes of instructions that are used for input output. One of them is the IO instructions. On Intel architecture devices, there is actually a distinct IO address space from the main memory address space in order to access that you need to use custom IO instructions. These days though, there's a preference for memory mapped IO which can use the regular memory load store instructions as long as the device's IO is mapped to main memory and as long as the device itself knows how to transfer data into main memory. Was there a question? Yeah. Well, if you set interrupt, that's one of two ways that can happen. Typically, a device is not initiating something. It's typically the CPU's initiating something from which it might need to hear from an IO device about completion. So let me see. So an exception would be, say a network packet arriving. And yes, then basically, in that case, the path of notification would be through an interrupt to the CPU. Oh yeah, no, it's still gonna go through the driver. I mean, the CPU will process the interrupt. It'll fetch an interrupt vector which should be normally calling a driver function. All right, so here's a memory map display controller. So the idea here is to map some of the registers and display memory into the normal physical memory space of the device. So in the old days this was done with hardware. Now it's done by control registers within the video card. So the idea is that the display memory is a frame buffer of what appears on the screen. You can write into that directly when it's memory mapped. So this address range here is, is that one right? Yeah, this is a display memory right here. So writing into that directly causes things to appear on the screen. In addition, the control register space is mapped to a memory area. Oh, I'm sorry, this is another, I'm sorry, this is a command queue, right? This is a larger scale command space where you would enter some description of the triangles that you wanna paint in a screen. So that's that memory up there. And the basic command and status memory spaces down here. So that's typically what you're calling to cause any major operations to happen such as a screen refresh. All right, or rendering a specific scene. Yes, and address translation, although you're writing physical memory through the direct memory access. So from the device side, you're writing directly into absolute memory addresses. From the processes side, you can use their page tables to include protection bits. So for instance, if there are areas such as this one, which are read only on the memory mapped memory, then you can set the read bit for those pages so that you don't inadvertently write to them. Okay, so device transfer, you can, we have discussed this a little bit already. You can do direct IO processing using load store instructions, which are easy to program, but consume all of the resources of the processor to do those transfers. For any very large and fast accesses, it's better to allow the device itself to access the memory bus and write its own data. So while it's doing that, the processor can be doing some other work. And here's an example of the direct memory access. A disk controller, let's see, I better start here. Device driver is told to transfer disks so the CPU initiates the transfer. Device driver tells the disk controller to transfer so many bytes of data to a particular memory address. So it gives it the hardware address. Disk controller starts the transfer, sends each byte to the DMA controller, which is a special controller here. So I guess the DMA in this case is being shared by all of the devices. The DMA controller is the thing that does the physical transfer and finally signals back to the CPU with an interrupt that it's finished. All right, so notification can happen through interrupts as we've just seen. Disadvantage of interrupts is that, or sorry, excuse me, disadvantage of interrupts is that they process these unanticipated events well. The disadvantage is that they have a context switch overhead, context switch-like overhead. You've got to save the state of the processor and then resume it later. And that can dominate the cost of doing the transfers if you have a lot of transfers. Polling is where the OS checks one of the control registers or status registers within the IO device and decides whether it's finished or not. It has very low overhead in terms of instructions, but it does require the attention of the processor periodically. It has to be a small enough interval of polling to make sure you don't miss or slow down the IO process. And it's not unusual to have both of them, especially in device drivers, I guess also in network adapters, where there's a really large volume of transfer and the transfers have to be happening very frequently. But there's some amount of asynchrony. So last topic I was gonna quickly go over USB. Since we're running out of time, I'll do it very quickly, but USB is actually a tree-structured physical topology starting from the host through a series of hubs. Interestingly, it's very simple from a signaling point of view. It's a master-only system. So for USB, somewhat surprisingly, the host always initiates things. If you want to wait on something happening, you have to explicitly create a process that's polling. In spite of a relatively complex topology and the very complex low-level API, USB does have a nice high-level interface based on the idea of endpoints. So endpoints are simple senders and receivers of information that follow one of a relatively small number of communication protocols. So something like network protocols, but even much simpler. That high-level view is built into the live USB driver. So when you communicate with USB devices, all you have to do is figure out what configuration the devices support and what interface, and that's basically, that will tell you the communication protocol for communicating with the endpoints. And if it follows some standard interfaces, then you can use standard built-in drivers to do that communication. All right. Okay, so we're almost there. So the last, very last topic is performance. And the last thing to be concerned about with IO performance is it's the propensity to have queuing effects when you have IO events going through common interfaces such as USB buses or even worse with serial and parallel buses. So you have some user code, probably several processes, which may be interacting with the same devices. Let's say several different processes trying to interact with a disk. A side effect of queuing theory is that when you have a lot of requests that sort of approach the theoretical capacity of whatever the device is, the queue tends to pile up. If you imagine you can process sort of 10 people an hour in some bureaucracy, as soon as you get five people an hour, like the DMV is a good example, I guess. It's far worse than that. But as soon as you get, let's say seven or eight people per hour going into the same queue, you start to get the queue periodically filling up and contracting. So that's queuing behavior and you can see your response times go up very rapidly as you hit high utilization. All right, so what are some solutions? Anyone think of some solutions to this? All right, any ideas? All right, make them think faster. The better solution is to try to decouple things because the problem is, with queuing theory, the worst problem is if you have many things going into the same queue, if you can make them independent and allow events to go through different buses or to, in the lane system, but basically provide different channels that are independent, you basically decouple the queuing events. On tree-structured buses like USB, if you put more bandwidth at the root, you create basically more independent paths through the tree, even if it's a tree. All right, and finally, buffering and spooling, which are ways of giving the process a way of moving the data somewhere, hopefully closer to its end point without necessarily executing it. All right, so we'll stop there and there's a quiz at the end, but we'll leave that till next time.