 Sorry about the weather. It's nice outside. It's the time of year when things feel like school feels like it should be out. So yeah, it almost is. So we have today, which is Monday, last week of the class. And Wednesday, we're going to talk about James Mckitt's paper on Atlantis, which applies some of these nice virtualization ideas inside the web browser. So it's kind of a neat systems paper since the web browser is, in fact, the new operating system. I discovered the other day, maybe you guys have noticed this already, that Chrome has a task manager. Yeah, that's when I was like, OK, well, now the browser really is the new operating system. If the browser has a task manager, I don't know what you can actually do with it, but just the fact that it has a task manager at all is a bad sign. But today we're going to talk about the other approach to virtualization. So on Friday, we talked about full hardware virtualization, where the goal was to be able to take an unmodified operating system and run it along as a guest inside a host operating system, inside a virtual machine provided via a virtual machine monitor. Today we're going to talk about a different approach that has some different trade-offs. All right, so first of all, we have four new groups in the 100-100 column, at least a couple hours ago. So now we have nine total. Harioa and Aaron Kuhn, are those guys here? Nope, I'm missing their big moment. The first undergraduates to finish sign the three, Allison Phillips and Dan Conway, that I think this is partly the next, I don't know if these guys are graduates since or not, good question. These names I'm not going to be able to pronounce. Harishankar, yeah, how was that? All right. And Aniket, yeah, and Arundam and Saptarshi. Are those guys here? No? All right, so congratulations, guys. We have 400 groups. I'm really impressed. Like, this is really cool. There's still a lot of time to go. So just also just on the horizon, this week we're kind of building up to the exam, which is a week from today at 3.30. And on Friday I'll do some exam review. You guys have actually hit the first target as far as the evaluations are concerned. So we're at 70%. And I will really stay sure to answer a question as soon as I write one, since I haven't started writing the exam yet. So when I do, the first thing I'll do is I'll write a short answer question, and I'll send it out on a piazza so you guys can see it. But you're really close now to the other milestone. So it's not that far to go to get to 80% and 90%. And again, I would really rather you guys do that because it's fewer exam questions I have to write, which makes me happy. And given the midterm, you guys may want to see the exam questions. Just think about the midterm when you think do I want to see exam questions or find a friend in the class and make sure that they've done their course evaluations. But next week, we're going to, toward the end of next week, so sorry, next week, the exam will be Monday. And then throughout the rest of the week, we're going to really focus on helping you guys finish on assignment three. So we'll have, particularly on Friday and maybe a few other times during the week, we'll have some special office hours. I know it's exam week, you guys are busy with other stuff, but we'll try to give you as much help as possible as you guys can all get into the 100-100 club or at least earn as good a grade on assignment three as you want. And we will give you a lot of support up to the May 15th deadline. Okay, so Zen and the art of virtualization. So how would you categorize this paper? What kind of paper is this? Yeah, what's that? Yeah, so it is kind of a wrong way paper, right? And it's going to point out some of the flaws with something that we learned about on Friday. It's also a big idea paper, right? So actually both of these are up here. It's a big idea and a wrong way paper. And of course, why would you look at papers that don't contain big ideas? So what's the, let's start with the wrong way part. What are they criticizing here? What's the design choice that the community was making at the time related to virtualization that they want to question? Ramya, do you want to continue your answer? Yeah, full virtualization. So we talked about full virtualization on Friday and it sounded awesome for a while, particularly when we were talking about the theory and the idea of trap and being able to trap and emulate privileged instructions. But they point out, and they take two issues with full virtualization. One of them is more specific to the architecture that you have to virtualize. So remember, this is the 1990s, I think, early 90s, remember correctly? And X86 is the dominant architecture. It still is for the types of systems that we run virtualization on. Although there's growing interest in virtualizing ARM architectures because there's a lot more of them now because they tend to do better on mobile devices. But anyway, so X86 is sort of the target and so there's two different categories of problems that this paper identifies with virtualization. One of them is specifically related to X86. So they say support for full virtualization was never part of the X86 architectural design. When people set up the X86 instruction set, which we've been laboring to support for decades, no one really thought about this or designed it into the standard. Certain supervisor instructions must be handled by the virtual machine monitor for correct virtualization, but executing them with insufficient privilege fails silently rather than causing a convenient trap. Efficiently virtualizing the X86 MMU is also difficult. These problems can be solved, but only at the cost of increased complexity and reduced performance. So essentially, again, the X86 architecture cannot be virtualized simply using a trap and emulated has these issues. Part of the issue is related to just the instruction set and how certain instructions behave when you run them in privilege versus on privilege mode. And then the MMU is also a source of concern and we'll come back to talk about the MMU later. However, they also point out some other arguments against full hardware virtualization that really don't necessarily have anything to do with the X86. And here their claim is that in certain cases, it's desirable for the guest operating system to actually see and be able to manipulate real hardware resources. So I think they have a couple of examples. Providing both real and virtual time allows the guest to better support time-sensitive tasks. So in a virtualized environment using full virtualization, things like the timer interrupts and the clock are provided by the virtual machine monitor. And so in certain cases, the guest operating system doesn't have access to accurate timers that represent the passage of real physical time. So they bring up TCP. They say to correctly support TCP time as an RTT around trip time estimates. People who have taken courses on networking, why is this a problem? Why could I have a problem here if my guest operating system doesn't have a strong and physically anchored, sort of physically correct notion of time? Yeah, so TCP and other networking protocols rely on timeouts on both ends to determine how to adjust performance and also to determine whether or not packets have been dropped. And so for example, if I get a packet on a connection, I may the guest operating system or the application or the TCP stack running in the guest operating system may have to deliver an acknowledgement within a certain amount of physical time. If my virtual machine monitor is providing this sort of illusion of time that's not quite correct or floats a little bit with respect to actual physical time or jumps around because it's scheduling me in funny ways, it may be very hard to do this. And so this is one of their concerns. And then there were other reasons, and I'm not going to talk about these because to be honest, I don't understand both of them yet, but that they claim that it would be desirable to expose real machine addresses. So actually allowing guest OSs to manipulate hardware memory rather than manipulate virtual memory that's being provided by the virtual machine monitor. So these are the two arguments. First argument is x86 is hard to virtualize. The second argument is even if we could virtualize it easily, there are cases where it's beneficial for the guest operating system to be able to access hardware resources. So please stop me if you have questions. I know I didn't do any review slides on full virtualization, but this is a good way to sort of do some review as well. So what's the big idea? If that's the problem with full virtualization, what do they offer as a solution? What do they propose? Well, virtualization is hard and maybe it's not even something I want to be able to do. So instead of doing that, what am I going to do? What's the technique that's in the paper? What's the technique that was part of what I promised we would talk about in class? Yeah, para virtualization. So what para virtualization does is it says, look, full virtualization is nice. It's fancy, it's shiny, it's cool. It's a nice trick that you can play on this unsuspecting operating system. You're going to run it inside a virtual machine without telling it anything and it's never going to know. But it's really hard to get right and maybe we don't want to do it anyway. Remember all the work that VMware has to go to to get this to work on x86? You have these shadow page tables which we haven't talked about, which are a nightmare. You have to rewrite certain instructions at runtime, actually doing binary translation. It just gets to be a mess. So let's try to take a different design approach. Let's train off some small changes, minimal changes to the guest operating system in exchange for big improvements in performance and in the simplicity of the resulting virtual machine monitor, which they have a different name for. So again, full virtualization, I do all this work in order to avoid changing the guest operating system. Their argument is look, it's just not that much work to change the guest just a little bit. And if I do a few small changes here and there to allow the guest to better support virtualization on x86 in particular, then it has these huge benefits. Okay, so they say we avoid the drawbacks of full virtualization by presenting a virtual machine abstraction. So remember, the virtual machine monitor, in their case they call this the hypervisor, is in charge of providing a virtual machine abstraction. Now before that virtual machine abstraction had to look identical to the physical machine, that was the goal. That allowed me to run an unmodified operating system. Here they're saying they're gonna provide an abstraction that is similar but not identical to the underlying hardware. An approach which has been dubbed para virtualization. This promises improved performance, although it does require modification to the guest operating system. Now this is also a really important point. One of their design principles is they do not require changes to the programs that run on the system. So this is really important. Para virtualization requires that I change the operating system, it does not change what they refer to as the application binary interface. The way that applications that run on the system communicate with the operating system. So the changes have to be confined to the OS. Why? Why is this such an important design goal for Zen? Yeah, I wanna be able to take a binary that's been compiled against a version of Linux. And in some cases, I mean open source is great and all, but in some cases I don't even have the sources for that. So I may not even be able to recompile it if I wanted to and I don't want to. Because recompiling everything on my machine just to move into this sort of virtual machine is a huge, huge pain in the neck. I don't know, I mean people who use Linux types machines, I mean we've all gotten very spoiled by these tools, these binary distribution tools. I have no idea how much time it would take my system to rebuild every software package on it from source. I suspect that it would be counted in the hours of just compiling and compiling and compiling. When you install something using a tool like apt-get, it is not in most cases compiling anything. It's pulling a binary from some server plopping it onto your server and onto your system and making sure that the right libraries are installed. It is not compiling things. That's why it's so fast. If it was compiling, stuff would be slow. So I don't want to change the ABI. So this is a really important point. So it's kind of interesting. With full virtualization, I wrote this complex virtual machine monitor to allow me to avoid changing the operating system and of course by extension any of the user applications. Here I'm moving one level up and I'm saying, okay, I'm going to make a few changes to the OS but I still don't want to change the application. So that's a really important design point because I think if they had said, oh, well it just requires one or two changes to your application, that would just kill the whole idea. It's not possible to do that. Okay, so let's talk about their four design principles. So this is the one that we just pointed out, which is that I cannot change the application binary interface and by this they're really referring to the kernel interface that applications use. So the way that I communicate with the kernel, with the Linux kernel, the guest OS on Zen has to be identical to the way I would do it normally. The kernel itself is different. The kernel on this is identical. I want to support full multi-application operating systems as this allows complex server configurations to be virtualized within a single guest OS instance. So essentially, this is arguing that I want to be able to sort of support a bunch of different OSs. If I can only get this to work with maybe Linux or something then it wouldn't necessarily be such a great approach. But it turns out they can get this to work with a bunch of different operating systems without making huge numbers of changes. So they claim that peer virtualization is necessary to obtain high performance and strong resource isolation. I love this. This here is all mine. Uncooperative machine architectures like the X86. So the X86 from the perspective of a system that's trying to perform virtualization is uncooperative. It does these things I don't like. It behaves in weird ways that I don't expect. And so in order to tame it, they're arguing that peer virtualization is necessary to get good performance. And finally, even on cooperative machine architectures, completely hiding the effects of resource virtualization from guest operating systems risks both correctness and performance. So correctness about the timing issues that we talked about in performance in other cases because there's this level of translation that's going on that Zen can avoid. Okay, so Zen introduces the idea of what's called a hypervisor. And the hypervisor in a lot of ways is sort of similar to the virtual machine monitor except that on some level what they've done is they've split the functionality provided by a typical virtual machine monitor into two pieces. So down here, I don't know if you guys can see this. This is the Zen hypervisor. So here's my hardware. Right above the hardware is the Zen hypervisor. And here are some of the things it provides. These are, it provides a virtual CPU, virtual memory, virtual network and a virtual block device or a disk. So these are the components of the virtual machine that Zen is making available to client-guest operations. Now the hypervisor here is quite small. In fact, they talk in the paper about the fact that they can load the hypervisor into the top 64 megabytes of all the address spaces of all programs running on a Zen server. So the whole thing fits within 64 megabytes. So it's pretty compact. So one thing that's interesting in the OS community is when people started to work on hypervisors in systems like this, to some degree I think there was a certain type of OS person who thought, this is fantastic. Because you remember operating systems, people tend to like to write new operating systems? So the hypervisor is kind of like a new operating system. Except it's more fun because it doesn't get big and bloated. You don't have to support all these ugly devices that are made by these idiots that don't know how to program in your operating system. So it's like, now I get to write this really, really critical, tiny little piece of code that's gonna be clean at these great interfaces. So it's kind of fun. It was like this sort of clean sheet design that you can't get away with if you wanna actually run on bare metal anymore. Okay, so I've got Zen, but then what you see here is that the control plane software, so for example, when you use something like VirtualBox, all the tools that are required to do things like launch new virtual machines and reconfigure the hardware for a particular virtual machine and load stuff into a particular slice of the machine or whatever, all that stuff can be moved into one of the guests. So there's a guest OS that's loaded that runs all of this control plane software and it uses a special interface that's provided by the Zen Hypervisor. So the Zen Hypervisor really tries to provide as little functionality as possible. All the features like the GUIs and the fancy command line interfaces you need to sort of reconfigure the system can sort of be refactored into this piece of user space code that writes up here, writes up here. So this is, if you wanted to reconfigure a Zen machine, you would have to log into and access this server and it looks like another computer. It's one of the guest operating systems that is allowed to use the domain zero control interface to do things like start other guest operating systems, reconfigure them, allocate memory for them, whatever. So this tool allows me to control the rest of the system. Over here, I've got a couple of, this is my vision of a couple of different guest operating systems. And of course, because I'm a computer scientist, I've got Linux, BSD and XP, just for the sake of it. But BSD had to be in there, that was really important. Okay, so do people understand sort of how this works? And the other thing that's really important to note here is that the Zen approach requires this hypervisor to be sort of slid underneath everything. So unlike something like VirtualBox, you can't just download Zen and run it on your machine. It requires that you install a machine that is ready to run it because at the bottom, underline everything, right? So this control plane software is not controlling the physical hardware directly. It's still using Zen. So that hypervisor has to be below all of the operating systems. And all of the operating systems that use it are, as you see, Xenolinix, Xeno, BSD, Xeno XP. These are modified operating systems where we're gonna apply some of the techniques for pair of virtualization, which we'll discuss in a minute in order to get them to run. So this is kind of a fundamentally different approach. And in a certain way, a lot more conducive and appropriate to server class environments, right? You could, if you wanted to, probably set up a laptop like this to run Zen in Ring Zero and have everything else running as a guest, but you probably wouldn't. You'd probably set up a server like this, however, and use it to run your virtual machines. Okay, so just throw up here. This is from the paper, clearly. The summary of these Xen changes. And so one of the things, when you're reading this paper and you're thinking about this idea, one of the things that Xen has to convince you of is that this is easy, right? If Xen said, we have to completely write Linux from scratch, rewrite Linux from scratch, and it took us two years to do it, and we ended up modifying 30% of the kernel, you would say no way. Cause remember, essentially now, the Linux community in order to use Xen has to maintain a version of Linux that works with Xen and then maybe a separate version of Linux that runs on bare metal. And so I essentially, in order to use pair of virtualization, I'm forking my development, I have to support these two separate targets. Xen is now sort of seen as a build target or a pair of virtualization approaches are seen as a build target for Linux, or I can build a version of Linux that's designed to run in Xen. But there are changes between that and the versions that run on bare metal. And so if those changes are huge and sprawling, then no one's gonna wanna do this and maintain. But it turns out they're not. So let's just look at a couple. So protection on the CPU. So remember, in order for, and they've modified what we talked about on Friday, a bit. So on Friday, what we said is, in order to allow virtualization to work safely, the guest operating system has to be deprivileged. I have to take it and I have to run it at a user privilege level. So it turns out that Xen makes a different claim. They say that they can make this safe as long as the guest operating system runs at a lower privilege level than the Xen hypervisor. And you might think, well, there's only two privilege levels. But it turns out, if you remember from Friday, it's like the one, you know, that the people that worked on this must have thought, at least there's one good thing about x86. They had this feature built in and nobody used it for decades. And then it was just sitting around waiting for it when we needed it. The rest of it's a huge pain, but this actually helps them out. So exceptions, the guest OS is allowed to register a descriptor table for exception handlers with Xen. So these are some of the features that the Xen hypervisor has to provide to guest operating systems and the changes that are required for guest operating systems to run. So normally the guest operating system would install its exception handlers in some hardware specific way. For example, on your system, when it boots, your OS 161 kernel tells the bootloader to load a specific piece of code at this address that the MIPS processor is going to trap to when an exception occurs. On Xen, it's different. I actually have to register at boot time the exception handlers I want to use with the hypervisor. So I have to tell the hypervisor, when I'm running an exception handler, an exception happens, jump to this piece of code. Now the nice thing about this is it allows me to avoid what I had to do when I ran the virtual machine monitor and did full virtualization because in full virtualization, those exceptions were going to end up being handled by the host first. So this is a difference. This allows Xen to actually vector the exceptions directly into the guest operating system. So that actually gets me a nice performance improvement. For system, so we'll get back to talking about system calls. It's time, right? So time here, they say each guest OS has a timer interface and Xen can make it aware of both real and virtual time. So real time is wall clock time. That's how long passes between two events as observed by an outsider. Virtual time on the other hand is time that only moves when the guest operating system is running. So remember, I've got three or four in the later part of their evaluation, 128 guest operating systems running on one server. There are not 128 cores on this server. So I am context switching. One of the things a hypervisor is doing is actually context switching between those guest operating systems in a way that's very analogous to the way that we context switch between processes in an operating system. The difference, of course, is that a typical operating system expects to be in full control of hardware and so it doesn't expect to not have the processor. It expects to be the one giving out the process. And so that's why full virtualization approach is rounded to trouble because the underlying guest operating system was not aware that it's now just an application and so time may, it may not see all of time paths. Here Xen makes sure that the guest is aware both of again, wall clock time which can tick even when I'm not running and virtual time which only ticks when I am. Does that make sense? Is that distinction? Okay. All right. So let's talk about two of the subsystems that Xen virtualizes and how the Xen virtual machine makes sort of interfaces with the guest operating system. So they acknowledge that virtualizing memory is hard on the X86. Well, actually sorry, this memory is hard in general. It's easier if the architecture has one of two or both features. The first feature is a software managed TLB and that can be efficiently virtualized. Why is this? Can you guys think of a way to do this? So we know the X86 does not have a software managed TLB but how can I efficiently, what's one approach to efficiently virtualize it a software TLB? Yeah, what's that? So that's essentially kind of what we do, right? So one way to virtualize the TLB and then they refer to a paper about this is to essentially have the virtual machine monitor set up a fake TLB, just an area of memory that the guest operating system is allowed to store virtual to physical translations. Because then when there's a fault, remember, in this case the hypervisor is going to see all of the TLB misses, it can pull from that memory region. So what would happen is when a particular guest OS is running, there's a TLB miss, that miss gets handled by the hypervisor, the hypervisor looks up the miss in this memory region and in fact in this way I can actually allow the guest operating system to use even a much larger TLB than is actually available, right? I can say okay, I'm gonna allow you to use a fake TLB that's twice as large as the real one because it's just stored in the cell. So this is actually pretty easy to do. The other thing I can do, the other thing that helps here, particularly when I'm running multiple guest operating systems and this is what I'm doing when I'm using a pair of virtualization, is if my TLB supports some kind of address space identifier and those address space identifiers like the ones on MIPS may have been introduced to make operating systems more efficient, but what this allows me to do is avoid completely flushing the TLB every time I switch between guest operating systems. If I don't have this support, that's the only thing I can do because remember the guest operating system, I mean you guys are already doing that when you switch between processes, now I'm switching between guest operating systems and of course they have their own TLB as well but this is a huge source of overhead and having address space identifiers allows me to get away with not doing it. Now of course the x86 has neither of these features. Too bad, right? And the x86 is going to, when a TLB misoccurrers, it's gonna walk, it requires you to set up page tables and it's gonna walk them by itself. And so unfortunately I don't see TLB faults and certainly my software managed TLB does not have address space identifiers, I don't even have a software managed TLB at all. So here's how Zen responds to these problems. They do two things. The first thing is they allow guest operating system to be responsible for allocating and managing the hardware page tables. So this is quite a big difference. We didn't talk about this with virtual machine monitors but on a virtual machine monitor, I can't allow this to happen because if I did, the, because the underlying operating system hasn't been modified, it could potentially take control of the entire machine and I don't want that to happen. Here remember, I'm allowed to modify the underlying operating system and they're gonna use that capability to allow the guest operating system to manage the hardware page tables safely. We'll talk about how they do that in a sec. So the second thing is they load Zen and so the Zen hypervisor itself, remember I need to be able to communicate with the hypervisor. There's exceptions are gonna go to the hypervisor. The hypervisor has an interface that guest operating systems use in order to manage the resources that they have available as part of their virtual machine. Doing things like for example, changing page table entries. So Zen exists in a 64 megabyte section at the top of every address space. And what this means is that the TOB entries for Zen are always loaded and they stay loaded regardless of whether I'm running the guest operating system or an application inside the guest operating system. So what's really interesting about this? They say that they can get away with this and it does not break applications. Why not? This comes back to something that we talked about way a long time ago when we talked about virtual memory. So remember, now every address space, guest operating system address space and now the guest operating system I can modify, right? So I can make space for Zen at the top of it. But all of the applications that run inside the guest operating system now have this hypervisor that's snuck in to the top 64 megabytes of their address space. Why doesn't this cause problems? Remember, I'm not modifying them. Yeah. Yeah, I left that space available to handle null pointer exceptions. So it's really kind of funny that they get away with doing this. It's sort of a neat observation about most applications is that this part of the address space is unused and it's unused so categorically that Zen can get away with hijacking it to allow its code to be present all the time. And so what this means is that for transitions in and out of the hypervisor, there is no need to do a TLB flush because those TLB entries are completely distinct from the TLB entries that are used by everything else on the system, both the guest operating system and applications. Do you guys understand how this works? Yeah, it's very clever. Okay. So, but then you might say, okay, I just said I'm gonna allow the guest operating system to modify the hardware page tables. How do I make this safe? And they have a really clever technique to do this. So I'll just read it and I'll try to explain it. So each time the guest operating system requires a new page table, perhaps because a new process is being created. That's when we create page tables, right? It allocates and initially says a page from its own memory reservation and registers it with Zen. At this point, the OS must relinquish direct right privileges to the page table memory. All subsequent updates must be validated by Z. So this is pretty cool. The page tables are a data structure that exists in the guest operating system's memory. Zen is going to allow the guest operating system to configure the processor so that at runtime when faults occur and the guest operating system or one of the applications inside the guest operating system is running, those page tables are used by the hardware to manage faults. The way they do this and make it safe is they tell the guest operating system, look, you can use any of your memory you want for your page tables. I am not going to dictate how you set up your page tables. I mean, the hardware dictates how the page tables are structured to a large degree, but where they go in your memory, that's your problem. However, the only thing I require is that as soon as you register a page of memory as part of your page tables, you now only have read-only access to that page. You cannot modify it without talking to me. So it essentially allows the guest operating system to set up its page table data structures, as usual, with the caveat that all the pages included as part of it are marked as read-only to the guest. What Zen then does is it provides an interface allowing the guest operating system to update its own page tables. So before, so let's say I want to update a page table because I've added a page to the process, it's requested some more heap, I'm swapping a page out to disk. What am I gonna do? I'm gonna access that page table data structure and this is what's brilliant, it's read-only. So I can access it, I can walk it, I can do a lot of things to it safely, without any problem, just the way I normally would. Remember, I don't want to modify the guest operates. However, when I get to the point when I wanna change part of the page table entry or add a new part of the page table data structure, then I need to communicate with Zen. So rather than just being able to rewrite the page table entry to say this page is now on disk or this page exists or whatever, Zen provides an interface that the kernel has to use to do this. And that interface traps into the hypervisor which has access to the page tables and the hypervisor can then validate that this update is safe. Essentially, it can validate that the update does not cause the guest operating system page tables, the page tables that the guest operating system is modifying to point to memory that's outside of the memory that Zen gave the guest. Do you guys understand this? I'm happy to explain it in more detail. It's very clever. Yeah, so I'll come back, let me come back to this, right? Yeah, so the question is how does the Zen hypervisor switch privilege levels? That's a good question. Yeah, any other questions about this though? So the operating system maintains its page tables as usual except for the fact that the page tables are all read only and there is a new interface, a special system called type routine that the guest operating system has to use to ask Zen to change its page tables on its behalf. The other thing that the guest operating system can do is they can batch update requests to the page tables. So let's say I'm creating an address space for a new process and I've registered the pages, but then I'm loading things from the L file, for example, and there's all these changes I'm doing sort of all at once. You guys may have done things like this when you implement stuff like AS define region. I'm allowed to sort of queue up those updates in a certain way and then register them all at once. And what this allows me to do is avoid the transition in and out of the hypervisor that's required to make this work. And again, so the top 64 megabytes of each address space is not accessible or remappable by guest OSs. This address region is not used by any of the common x86 ABIs however, so this restriction does not break application compatibility. Again, if I want to, I can write a funky linker to load my code starting at null. And in fact, I've committed to doing this this summer so that I can find out all the places where you guys are doing silly things like checking for null. It's this pointer null, this is your pointer null. Who cares? It could be null. Next year maybe it will be null and we'll see how many kernels break. But in general, common x86 ABIs do not load code. Okay, so now let's talk about the CPU interface. Let's go to Yusuf's question. So it says principally, this idea that I wanna put a hypervisor in between the OS and hardware violates the assumption that the operating system is the most privileged entity in the system. In order to protect the hypervisor from OS misbehavior and allow me to isolate domains from each other, guest operating systems must be modified to run at a lower privilege level. And yes, finally x86 has a feature that we like. So remember, x86 has this feature called privilege range. It is not commonly used. They point out that they're the first OS to use this since OS two, like years and years ago to actually utilize this feature. So efficient virtualization of privilege levels is possible on x86 because it supports four distinct privilege levels in hardware. This is part of the x86 standard. They're numbered from zero, which is most privileged, to three least privileged. OS code typically executes in ring zero because no other ring can execute privileged instructions while ring three is generally used for application code. Two-way knowledge rings one and two have not been used by any well-known x86 OS since OS two. And the idea with privilege rings is you can think of, we talked before about there being a way for an application on x86 to trap into the kernel, essentially to activate a mechanism that both increased the privilege level and caused a certain piece of code to run. On Zen, they utilize the fact that x86 actually, and I won't get into too many details, but x86 actually has a mechanism for moving between these privilege rings. So if I'm running in ring one and I can trap into ring zero, just like if I'm running in ring three, I can trap into ring zero. So what we do is we take the operating system, remember we're modifying it here, and we modify it so it will safely run in ring one and we load the hypervisor to run at ring zero. So the hypervisor is more privileged and if the guest operating system does something bad, it will activate the hypervisor and the hypervisor gains control of the machine in very much the same way that an operator gains control of a machine when a process does something wrong. And the hypervisor can then choose to kill off the faulty guest OS. However, there's also a mechanism for the guest operating system to trap intentionally into the hypervisor. For example, when I wanted to change my page tables. So there's both an interface allowing me to access domain. So this is the origin of the term domain zero or DOM zero. If you use a Zen system, this is something that you hear a lot because that's where Zen runs. Zen runs in DOM zero. So here's a little bit of review. When we think about exceptions that are generated, so they talk about trying to improve performance of the various common exceptions that are created when the guest operating system or applications running inside the guest operating system are running. So what are the two that I probably worry about happening often enough to create a potential performance bottleneck? Yeah, okay, so this is x86. So it's a page fault. It's one of them, what's the other one? System calls, yeah. Page faults and system calls. So typically only two types of exception occur frequently enough to affect system performance. System calls, which are usually implemented via software exception and page faults. Now the system calls, they kind of have a really cool solution for, which is they allow the guest operating system to register what's called a fast path system call exception handler. What that allows is when that guest operating system is running, system calls will not be handled by Zen. The hypervisor won't get involved. Remember, on full virtualization, there's nothing I could do about this. The VMM was gonna see every system call because there was gonna end up at the host and the host act to actually push it back out to the VMM to be handled. On Zen, because I'm modifying the guest, the guest, one of the things the guest will do is it'll register its system call handler, the location of it with Zen. Zen then does some funky things to try to make sure that it's safe to execute that code. Assuming it is, when a system call happens, or actually, so what Zen will do is it'll configure the machine when that guest is running to use the guest's fast pass signal call handler. So system calls will not go through Zen, they will go directly into the guest operator. So this is one of those places where I've traded off a small modification to the guest operating system for a big performance way. I can trap, system calls can trap straight into the guest operating system when it's running. This is cool. Unfortunately, they talk about the fact that they cannot apply the same technique to page faults. Page faults have to go through Zen. There's no way to vector them directly to the guest. OK, so we're pretty much out of time, actually. Do you guys have any questions about this? So I've tried to sort of hit the highlights here. If you look at the rest of the paper, which you'll find out, of course, there's two things that this paper needs to convince you of. What are those two things? What's one of them? What's that? It works. OK, can we do better than it works? Yeah. It's fast, right? So they have a bunch of performance results showing that it's fast. Showing that the paravirtualization approach can reduce the overhead of various types of transitions that are involved in virtualization. So that's great. What's the other thing they have to convince you of them? What's that? It's easy. I like that. That's a good way of putting it. Yeah, there weren't that many changes they had to make. And to do that, they have this one little figure that just shows the number of lines of code they had to modify on legs, and it's small. And so those two things taken together have caused this to really be the dominant form of virtualization that's used for hardware, particularly for servers today. This is how this works. I do. So I'm not as cool as it sounds running an unmodified guest. Yeah, but the nice thing about x86, remember, is that the x86 architecture specifies the structure of the page tables and the page table entries. So this is a great point. Now here's the thing. It's kind of interesting. To some degree, x86 helps me out here. So the question is, don't I need to know something specific about the structure of the guest's page table entries in order to validate that the updates to the page table data structures are safe? The answer is yes, but it turns out because the x86 uses a hardware-managed TLB, the x86 specifies the format of those entries. Now, for example, Zen may not be able to determine what certain parts of the page table entry mean. So for example, Linux and Windows may use different types of parts of the page table entry. I have no idea to indicate where the page is on disk if it's been swapped out. But Zen doesn't care about that. What Zen cares about is, is this page in memory? And the in-memory format is identical because the hardware has to understand it. So the hypervisor here can essentially play the role of hardware. So when you give me a new page table entry, I look at it the same way the processor would. If it indicates that it's in memory, I look at that physical address that it's mapping to, and I can very quickly check that that's part that's within the range of physical addresses on the machine that I've given you permission to use. Does that make sense? It's a great question. This is a really, really cool question. So remember, so actually, to be honest, I have no idea what the difference is between rings one and two and three. I think it has to do with their view of memory. The important difference between ring one and ring zero is that ring one is not allowed to execute privileged instructions. So this can only be executed in a ring zero. And so in order to use zen, what I need to do is rewrite, and they do this. They talk about the fact that part of what they had to do to get operating systems to work in zen is run them so that they can safely run in ring one. When they want to do something that requires ring zero privilege, they have to contact the hypervisor. So that's things like modifying special registers that control how things are interpreted. So any privileged instructions have to be executed by the hypervisor. But because I'm rewriting the operating system, I can get rid of them. I can make it safe to run in ring one. Again, keep in mind, it's one of my, emphasize this. Zen Linux is not the same as Linux, right? It requires the hypervisor to be there. So if I take Zen Linux and try to run it on bare metal, it won't work. Because in order to do things like change its page tables is trying to communicate with this hypervisor that doesn't exist, right? And so I have modified Linux fundamentally to interact with this new domain zero hypervisor. But doing that allows me to get better. All right, so on Wednesday, we will discuss James's paper on virtualization. When I'm calling virtualization in the browser, he may not like that description of it. But if he doesn't, he can complain somewhere. And then Friday, we will do exam review in class. And I will see you then.