 Nice, my mic's on. Oh, it's good. OK, so today we're going to continue talking about virtualization. And we're talking about kind of our first actual research paper that I ask you guys to read. Be honest, how many people read the paper? Or at least looked at the paper, clicked on the link. Aspirational goals of, you know, yeah. So this is a good paper. And it introduces sort of a new point in the design space. This came out of the Cambridge Computer Laboratory, which is sort of a pretty good track record of building actual things that people really use. It's a long list of co-authors on this page and a bunch of different people worked on this project. So, OK, assignment three checkpoint, we're still shooting for a five-five-at-five deadline. So that's a week from Friday. Clearly, people are making it to the finish line. There's about 10 assignment 3.3 submissions so far. Most of them are perfect scores. So, you know, finish up. You guys have a little over a week and a half. OK, for those that read the paper, what kind of paper is this? Well, OK, but go back to our taxonomy. That's what the paper's about. What kind of paper is this? You know, we're reading this paper. This paper is from the 1990s, I think. So, yeah. Somebody either knows this exactly or read the slides. Because, yeah, this is both a big idea paper. It introduces a new approach to virtualization. You can also think of this as a wrong way paper. This claims that approaches to try to providing full virtualization are making the wrong design trade-offs when they try to support operating systems, unmodified operating systems. So remember when we talked on Monday, that was our goal, was just to be able to support unmodified operating systems. So the wrong way is this approach of full virtualization of not modifying the operating system. So the Zen trade-off, so there's some quotes from the paper. So what they point out is that x86 was never designed to support full virtualization. And this is what we talked about last time, this problem that there are instructions that are part of the x86 instructions that are not classically virtualizable. So it is not possible to trap and emulate those instructions because they don't trap properly. So for example, certain supervisor instructions, if I execute them with insufficient privilege, fail silently rather than causing a convenient trap. Officially virtualizing the x86, MME was also difficult. We'll come back to this in a few minutes. These problems can be solved, but only at the cost of increased complexity and reduced performance. So what they're pointing out is, look, all this work that VMware and others had to do to fully virtualize the x86 architecture because of these irritations. So what are we going to do instead? Well, OK, so sorry, they also make another argument here as well, which is that, look, not only is it really, really hard to provide this illusion that the guest operating system is running on top of real hardware, it's also in certain cases not the best idea. It limits what we can do. And there's some opportunities here to provide better performance by providing virtualized hardware that's a little bit different than the underlying hardware. So there are situations in which it is desirable for the hosted operating system to see real as well as virtual resources. So for example, timers is one thing that we'll talk about. Why do timers become an interesting concern? Yeah. Well, OK, so let's say I expose the operating system to the underlying hardware timers. So firing along at a pretty predictable rate. What is the operating system not used to in this environment if it's running on virtualized hardware? So I'm running inside a virtual machine monitor. Yeah. OK, so that can happen, but why? No. No, the operating system is running along as usual. But what is the operating system not used to? Yeah. Yeah, the operating system's not used to being descheduled. So if you're running in a virtual machine monitor or hypervisor, that operating system can be stopped. This happens all the time. You go do stuff on EC2. Your little fake machine is sharing hardware with a bunch of other fake machines. And there are more virtualized servers on that box than there are CPU cores. Probably, unless you paid big bucks for one of these mega instances or whatever. But if you're running one of those three little micro instances, they're probably cramming hundreds of those on a relatively beefy server. And so your operating system from period from time to time gets stopped, and then time goes by, and then it runs again. So what sort of problems can this solve? There's a couple of examples on the slide. Anyone here doing networking stuff? The examples they use are from network stacks. Does anyone know that TCP exists yet? Is that something that you've at least glommed onto over time? OK, good. Yeah, so there are these networking protocols. And one of the things they do is they try to figure out, when do I need to resend a packet? I wait for a while, and then I assume at some point if the other host hasn't gotten that packet, if I haven't gotten an act or some sort of indication from the other side of the connection that the packet was received, I have to retransmit it. But these timing estimates get wonky when the operating system isn't running all the time. Because, for example, I might have run. My operating system ran. It sent some data. Then it was descheduled. Some other operating system ran on the same set of cores. And while I was waiting, the data came back. But it sat in a network queue somewhere in the virtual machine monitor and the hypervisor. And now when I'm scheduled again, that packet arrives. But from my perspective, it looks like it arrived a lot later than it actually did. But part of the delay was me. Me waiting to be rescheduled. Does that make sense? And if I'm not aware of that when I'm doing these sorts of estimates, that can cause things to be wrong. So what I actually want to know is I want some information from the virtual machine that says, hey, this is when this packet actually arrived in real time. And so that requires that the operating system that's running as a guest now be aware. This is not for virtualization anymore. We are going to let the guest operating system know that it is running inside a virtual machine. And there are some features that come along with that. So one feature is the ability to distinguish between virtual time and physical time. Physical time that marches on monotonically. Virtual time that measures that stops when I'm not using the CP. And then there are some other things on the slide as well. Give me an access to low level information about machine addresses and some other things. OK. So the big idea here is this idea called pair of virtualization. And so the goal is, can I trade off small changes to the guest operating system in exchange for improved performance? And improved simplicity too. So there's certain areas, particularly having to do with memory management where the tangled web that I have to weave in order to get full virtualization to work can be substantially simplified by making some small changes to the guest operating system. So not only does my guest operating system change a little bit, but part of the goals here is that it performs better, but also that the hypervisor or my virtual machine monitor becomes a little bit simpler. OK. So you say, we avoid the drawbacks of full virtualization by presenting a virtual machine abstraction that is similar, but not identical to the underlying Harper. An approach which has been double pair of virtualization. This promises improved performance. Although it does require modifications to the guest operating system. Now here's a critical component of this design. It does not require changes to the application binary interface. What does that mean? So I am going to change the operating system. Got that. We'll come back and talk about what those changes are, some of those changes. What does it mean to leave the application binary interface? It's kind of a strange term. Maybe you guys haven't heard this before. You've heard API, this is ABI, what does that mean? Yeah. OK, to what degree do I not have to change them? Yeah, exactly. So not only do I not have to change a single line of source code, I can run the same binary on a para-virtualized operating system. And it runs identical. So this is critical. What would be the problem if that wasn't the case? Let's say that they broke the ABI. Now what becomes hard? Yeah, I mean, I have to recompile everything, all my application binaries from scratch. I may not have source code for everything, proprietary things I got. So yeah, it's a core design goal of Zen to leave the application binary interface unchanged. That's really important. OK, so here are the Zen design principles. And so when you're reading a paper, this is the design principle section, or the design section is a part of a paper where authors will say, here's what we're trying to accomplish. And usually there's discussion surrounding why these things are important. So unmodified application binaries. And the reason this is so important is because if we require changes to the applications as opposed to the operating system, it's going to be very difficult to convince people to transition and use Zen. They're just going to keep using their old stuff or they keep using full virtualization. Again, they're volunteering to make the changes required to do per virtualization to the operating systems that they have to modify. They are not suggesting that everybody go out and rebuild all their software from scratch. That's way too much work. Supporting full multi-application operating systems is important. So I want to support real operating systems. So the changes that I'm going to make to support per virtualization aren't something that I can just do in these little toy operating systems like OS 161. I have to be able to make these changes to Linux, Windows, BSD, real, modern, multi-application operating systems. So they claim that per virtualization is required to get better performance and we'll come back to this. And I love how they refer to this, the uncooperative machine architecture like x86. So x86 is not that easy to virtualize. And so rather than trying to force x86 to cooperate with me, I'm just going to say forget it. I'm going to make some small changes to my operating system. And then finally, again, this comes back to what we've talked about before. Even if the underlying architecture was more cooperative, they're still arguing that I can actually make things work better by modifying guest operating systems slightly in ways that allow them to detect that they're running on virtualized hardware, but also provide some new interaction opportunities for them to make things a little better. OK. So here is the, so Zen introduced this idea of a hypervisor. So the hypervisor, remember what the architecture we had before when we talked on Monday? We had an operating system that controlled the actual physical hardware, the operating system that runs that you installed on the machine. So if you are Windows running in a bunch of guest inside virtual box, Windows is in control of the actual physical hardware. In Zen, there is no operating system in control of the physical hardware. There is a new piece of software called a hypervisor. The hypervisor is less fully featured than an operating system. The hypervisor is supposed to be as small as possible. As time has gone on, one of the criticisms of approaches like Zen has been that hypervisors have started to get more complicated. And so it starts to feel like I'm building a new operating system that is just sort of designed that I can run other operating systems inside of it. But at this time, this is the dawn of this era. So a hypervisor is this very, very small piece of software. The hypervisor has direct machine access. The operating systems do not. But because I'm going to modify them, I can play some tricks with the kind of access they do have. And now what happens is my guest operating systems run on top of the hypervisor. In order to control, so they might wonder, how do I control the machine? How do I create new virtual machines and stuff like that? So this is clever. What Zen does is they actually allow you, you run the control plane software inside one of the guest operating systems. This is something that if you configured VMware servers, you're used to. There's like one virtual machine on there that provides a web interface that allows you to make changes and supports usually an interface to a client-side application. And you can log into that and poke around and do things. So this is the model. So the software, the configuration software, is itself not part of the hypervisor. That makes the hypervisor simpler. It means that if I want to upgrade the control software, I just have to upgrade that one guest instance. So this is kind of a nice architecture. I could have built that into the hypervisor, but they chose not to. Yeah, good question. So remember, the hypervisor has direct access to, the hypervisor is running on bare metal. The EMMs in general are not designed to run on bare metal. The EMMs are designed to run as an application inside another operating system. This is communicating directly with the hardware itself. And the virtual machine abstraction that's provided by the hypervisor in this case is a bit different. Does this make sense? Okay, so this is the architecture. I've got my hardware down here. I have the hypervisor. You can see it's providing a virtual CPU, virtual memory, virtual network, and IO devices. And this is actually one of the places where hypervisors provide some benefits, is that the hypervisor IO interface is actually a lot simpler. And it tends to be more performant for the operating system. Sorry, sorry, not on top of it. In this configuration I've got three guests. I have a, I feel like there's always this bug following me around. Like the last month, maybe the bug is coming out of some part of my body I can't see. But it's happening over and over. Go away. That's kind of scary actually, I don't know. Anyway, so there it is again. So I have three guests running here. I have a Linux guest, a BSD guest, and an XP guest, and these are hard to read. These guests, now this is important, are running Xen or Xeno aware device drivers. So this is sort of a way of expressing the fact that these guests have been modified to work with Xen. So running on top of Xen requires some changes to the guests. And then there's user software that's running on top, unmodified user software. Okay, any questions? Okay, so here are some of the changes that Xen makes. To the underlying machine interface. To make virtualization easier. So for example, protection, guest OS must run at a lower privilege level than Xen. So how is that possible? I thought I only had two privilege levels. User mode and kernel mode. How am I gonna modify the guest operating system to run it at a lower privilege level? I mean clearly Xen, the hypervisor, has to run at the highest privilege level. It needs direct access to the hardware. So it's gonna have to run in kernel mode. How do I get the, how can I achieve this? How can I get the guests to run at a lower privilege level? I could run it as in user mode and then I'd have all some of the problems that I had last time, but there's another option. Remember, anyone remember end of last lecture? Yeah, what's that? X86 privilege ring. Yes, the X86 privilege ring. Weird, so X86 again has some features lying around that nobody used. Literally, when they started using X86 privilege rings, I think they were the first operating system to do that in 30 years. There were operating systems in the past that used ring one a little bit, but Xen was like, whoa, check this out. And for some reason, they kept that feature around, like that doesn't make any sense to me. But I think that someone at Intel would be like, wait, hold on, why are we supporting these privilege rings again? But they probably said that about a lot of things, actually. That's the hard time, that's the problem you have with winning, right? You end up having to support all these features for a thousand years. Okay, so system calls. So remember what happened on a system call before? I had this pretty gnarly path. I had to go, system call would trap to the host operating system, host operating system would pass it to the guest operator, to the host, sorry, to the VMM, and the VMM would have to handle that back somehow into the guest operating system, so they would be handled. Xen, remember, I'm modifying the guest. So what do I do? When the guest operating system starts to run, it's allowed to register a fast path signal call handler with the hypervisor. And the hypervisor checks that signal call handler to make sure that all of the addresses are valid. I'm not trying to jump into somebody else's operating system. But then what it does is that for system calls, I can actually vector that system call right into the guest operating system. Does that make sense? Full virtualization, because the guest is unchanged, the traps are gonna end up in the host operating system, so it goes host operating system VMM back to the guest. On Xen, I modify the guest so that when the guest starts to run, so remember, the guest is an application now like other applications, so when it's scheduled and starts to run on the hypervisor, it communicates with the hypervisor to register this set of addresses that it says when there's a system I'll call jump to this address in my operating system. And then those calls can be handled right back to the guest. So this is one example of an optimization that's possible on Xen, that wasn't possible before. Time, each guest to us has a timer interface and it can actually pull both real time and virtual time. Real time being the actual time that marches along monotonically, virtual time being the time that's measured since I started running. Yeah. Well remember, there's no host OS here, there's a hypervisor, right? So the hypervisor doesn't allow the guest operating system to handle all exceptions, only some of them. So for example, software exceptions that are caused by system calls are an example where the guest is allowed to register this handler. But there are certain handlers that the hypervisor has to handle, right? Certain types of exceptions, it wouldn't be safe to let the guest have and handle everything because then it could essentially take over the machine. But system calls are safe. So essentially, you can think about what happens. The guest starts to run, it communicates with the hypervisor. Does anyone who read the paper know what those are called? So normally they communicate with the operating system, you make a system call, to communicate with the hypervisor, you make a hypercall, yeah, a hypercall. French would say hypercall, right, hypercall. So the guest operating system, which has been modified to work with Xen, makes a hypercall and it says, when there's a system call, here's the address that I want you to jump to. And the hypervisor looks at that and says, okay, that address is inside the memory that I've allocated for you. So that's cool. And then it registers that interrupt handler with the interrupt server's routines. So do they know where to go? Does that make sense? Yeah. Okay, segmentation. Well, okay, sorry, paging. Let's talk about paging. This is really interesting. And unfortunately, you guys can't appreciate how nice this is because we never talked about shadow page tables because, again, that would take like three lectures. So virtualizing the X86 MMU is really, really hard, partly because the MMU is allowed to walk the page tables that have been registered by the operating system to find translations. And there's this gnarly thing that you can do with full verges and clearly it works. I mean, whenever you use virtual box, that's what's happening. But what Xen allows the guest to do is maintain its own page tables. We'll come back and talk about how this stunts. Pretty cool. With modifications to those page tables checked by Xen. This is particularly important during lots of operations that just access the page tables. So you can think about page table data structures. They don't change that often, but they get used all the time. Every time a process runs, those page tables get loaded and then the MMU is using them constantly to figure out how to translate addresses. But the number of times that those page tables actually change is pretty small. Like I call S break or I use M map to set up a new memory or something like that. But it doesn't happen very often. So page tables are sort of like a read mostly data structure. And what Xen allows me to do is give the MMU access to my own page tables, which I can read from and it can read from, but I can't modify without Xen's help. So this is a really, really nice approach. It means it's safe in the common case and also safe in the uncommon case where I'm changing things. All right, we'll come back to this. So we just talked about this. Xen makes this point, which is that there's a couple of things that the hardware could have that would make virtualizing memory easier. One is the software managed TLB or a TLB with address space identifiers so you don't have to flush it on every transition. And they also point out that X86 doesn't have any of these features. Keep in mind, I mean it's easy to, I don't wanna throw too much shade at X86. I mean, this is an old architecture. I don't think that they were thinking about this stuff when they designed it. What's that? I don't think the rings were designed to support virtualization. I don't know what the rings were for. I mean, that's a great question. I would love to know what the protection rings were supposed to be for. Now we know what they're for. But usually you don't add a feature and expect that people would discover what it's for 20 years later. But they did, I mean it's kind of good design. Someone at Intel got a big promotion. It's like, yeah, I didn't know what that was for either. I'm glad VMware figured it out and told me all about it. I don't know, I might not be giving them enough credit. Maybe they knew what it was for, but who knows? Anyway, it didn't get used for a long time. Okay, so here's what Zen does about memory. Guest operating systems are responsible for managing their own page tables. This is good. It's the data structure I want to let the operating system manage. And they try to minimize the emulment of Zen in this process. So the more I have to communicate with Zen, the more I have to make these hypercalls, the slower the operating system is going to get. So this is interesting. When we talked about full virtualization, the operating system slows down because these traps get vectors in the host operating system and have to be handled back to the virtual machine. Or in certain cases, VMware has to rewrite parts of the operating system, which might also slow it down. In Zen, I communicate with the hypervisor by making these hypercalls. And so we're just introducing a new level here. So remember when we talked about applications, the more they communicate with the operating system, the more they slow down. Same thing on Zen. The more the operating system has to communicate with Zen, the more it slows down. It's making hypercalls as expensive. It at least requires a context switch because they have to run the hypervisor in its own address space. Now, one thing they do that's very clever is to prevent having to... So normally every time I made a hypercall, I would have to flush the TLB and load new mappings that Zen was using. But they get away with this kind of clever trick, which is that they load Zen. The entire hypervisor is under 64 megabytes and they load it into the operating system of... They load it into the address space of every operating system. They found this unused region at the very top of the address space. And they put Zen there on every operating system that gets loaded. And so what this means is that when I make hypercalls, I don't have to flush the TLB because I can allow TLB entries for the hypervisor to coexist with the operating system itself. Now, obviously the operating system doesn't have right access to the area where Zen lives. But this does mean I can get in and out of the hypervisor without having to flush memory mappings, which is good. That helps with performance. So the trick here is to figuring out how to allow the... So what Zen does is, this is cool. So when guest operating systems create page tables, here's what they do. And now that you guys are doing this yourself, you have some idea of how this works. They create their page table data structure with their address space, is what you guys have been thinking of. Any memory that's required is part of that address space. They have some access to it at the beginning so they can make any changes they want. When they register the page table with Zen, they relinquish right access to it. So at that point, what Zen does is it marks any pages of memory that the page tables are using as read only to the guest operating system. Now again, I can still let the MMU use those to look up translations. So that's good. The question is how do I change them later? So what they do is they have a new hypercall interface that they've created that allows the operating system to request that Zen make a change to their page tables. So you imagine, okay, there's a few bytes I need to change at this location. I tell Zen, please change this and it does it on my behalf because Zen has right access to this memory. To improve performance, if the guest wants to, it can batch updates together. So rather than doing one update at a time, which could be very expensive, they can say, here's a bunch of updates that I want you to make. I make one call, I pass you a data structure that has all the information about what to do. Zen runs, has right access to those pages, makes those changes, and then returns. Does this make sense? It's kind of, this is pretty clever. And again, it is way simpler than having to maintain shadow page tables. Okay, so let's talk about the CPU interface. So I already talked about memory. These are modifications that Zen is making to the underlying machine hardware interface to better support virtualization. So with memory, it's that I'm playing this game with the protection of the pages that the page table data structures live on that allows them to be used freely for translating addresses and gives guest operating systems in a way to modify them when they need to. Now with the CPU, the insertion of hypervisor below the operating system violates the usual assumption that the OS system most privileged ending on the system is what we talked about on Monday. In order to protect the hypervisor from OS misbehavior, guest operating systems must be modified to run at a lower privilege level. So if the operating system runs at the same privilege level as Zen, it can do everything that Zen could do. It can modify its own page tables, it can break other operating systems, so clearly have to run at a lower privilege level. And here we go, here privilege rings to the rest. So efficient virtualization is possible in X86 because it supports four distinct privilege levels. So zero is the most privileged, it's full kernel mode, three is the least privileged, it's user user mode. OS code typically executes in ring zero, and application code is typically run in ring three. That's how normal operating systems work when they're running on bare, on bare metal. What Zen does, here we go, yeah, to our knowledge rings one and two and not been used by any well-known OS since OS two. Does anyone remember OS two? I didn't think so, okay. So anyway, in 1980s, 90s maybe? I can't remember whose product that was. If I remember correctly, there were some nice ideas in OS two. Clearly it didn't matter because it's gone. So essentially, okay, sorry I lost my, so what they do is they run, they modify the operating system to run in ring one. So there is enough that it can do on its own in ring one and when it needs help from the hypervisor, it does what we've been talking about and it essentially executes a system call or a hypercall that causes it to start running in the hypervisor and start running in ring zero. So again, this is very, very analogous to the way that system calls are made by processes. So when a process needs to do something that it can't do, doesn't have the privileges to do, it makes a system call. The system call starts executing operating system code at a higher privilege level that then takes care of whatever the application wanted to do. So this is the same model they're gonna use here. Just like with processes, I start to worry about the overhead of boundary crossings. So I start to think about what are the two things that happen often enough that they would really slow the operating system down if it had to communicate with the hypervisor every time. And those two things are page faults and system calls. System calls and page faults, okay. So system calls, we've just talked about this. We let the operating system register this fast exception handler that can be used by the processor without entering ring zero. So when I register this, Zen checks it and then as the guest operating system is running, that's the handlers that are used. This handler is validated before installing it in the hardware exception tube, okay. So I think that's the last detail I had. I actually wanna go back and talk about time for a minute. So let me ask a question. So we talked about cases where knowing the real time is important. So knowing the real time that a packet return, for example, is useful and actually really critical when running congestion algorithms like TCP. Where's a case where I need to use the virtual time? Where's a case where it's really important that the OS know the virtual, yeah. Yeah, absolutely. So imagine, so remember, when the operating system gets descheduled, it doesn't know. It's just like any other program. It just gets yanked off the machine and something else starts to run and that's something else starts running all of its own applications. So if I didn't know how long I was stopped for, I would come back to life and be like, wow, that thread has been running for a long time. I never even gave permission to run for that long. I don't know how this happened. So that's a case where I really need to be able to know the virtual time because I shouldn't penalize threads for running when the entire operating system that they're part of was descheduled because clearly they weren't using the CPU. Okay, so, and this is what I'm interested in, right? If you think about full virtualization, I still have to make these modifications to the operating system. It's because it's impossible on x86 to run an unmodified operating system because of these problems with the architecture. So VMware is like, okay, we're gonna do all this fancy binary translation and we'll make these changes at runtime. Zen's approach is, you know, why don't we actually be a little more intentional about the changes that we're making? Instead of making them on the fly by rewriting the binary, doesn't that sound scary? I mean, how many people would like to be assigned to a project in their new job that involved on the fly binary rewriting of an operating system kernel? Okay, I mean, you guys are nuts, right? Like that sounds awesome. Also, you were gonna crash a lot of machines. I'm just gonna point that out right now, right? Talk about debugging, you know? Oh yeah, wow, I made some changes to the binary and it crashed, right? Good luck. GDB's like, any other questions about Zen? This is a cool paper, yeah. Yeah, sorry, I didn't talk about this before. So Zen allows you to configure virtual machines in the same way that we talked about on Monday. So when you set up a new domain, or they call it, you allocated a certain amount of memory, for example, right? And you might allocate it a certain number of cores. And then the hypervisor is doing the resource allocation between them. So on a multi-core machine, I can have multiple domains that run at the same time, right? I can have, let's say I have 100, I don't know what type of machine they ran this on. I'm sure it's in the paper, but maybe this is like a 16-core machine. So there were 16 operating systems running simultaneously. And each one of them could have been allocated like a couple gigabytes of memory. I'm sure the details are in there. That's not specialized, right? I mean, I can buy a machine with 32 CPUs off the shell. There's nothing special about the machine. The machine is just a commodity server at the time, right? I mean, would you have been able to get this good performance running on a desktop-class machine? I don't know, right, maybe. Part of their goal is, I think, well, I don't know, I won't put words in their mouth, but I think they're aiming more at a server-class market, right? Because this is the kind of thing that people who deploy servers would have loved to be able to do, right? Like, you know, on some level, when you start using virtualization in your servers, what you care about is how many VMs can I pack onto one machine? Because as soon as I run out and I start to perform badly, I've got to buy more hardware. All right, so on Friday, I'll be back and we will talk about, I'm gonna write a new lecture. I haven't done that in years, about container virtualization. So we'll talk about Docker and stuff like that. I'm sure that'll be fun. See you guys then.