 A member of faculty at the University of Cambridge, but I'm actually on leave at the moment. The Zen project started five years ago with just myself and a few graduate students. And we worked on the project for a number of years before we really started having any impact. And then we did the first public release about two and a half years ago. And then at the 2.0 release back in November last year, or November 2004, and then in December 2005, we did the Zen 3 release. And over the last couple of years, the project has really taken off and snowballed. And instead of just the original four of us that were working on it, there's now a real big development community with lots of people in different companies and also just people in the community working on Zen. So it's been a real fun ride, getting to where we are today. So what I'm going to be talking about today is giving an overview of the different kinds of virtualization which are out there and where Zen fits into that space. And then we'll look at the architecture of Zen, some of the features which are new in the 3.0 version of Zen. And in particular, have a do a bit of a deep dive on one feature which has actually been in there since Zen 2, but is one of the main reasons why people use Zen. And that's the VM relocation feature. That's enabling you to move running virtual machines from one physical node to another. Then we'll have a look at where we're going with Zen and then take questions at the end. So there are a number of different kinds of thing called virtualization. And we just run through some of the possibilities. So quite a common one is to have a single operating system image, but to then provide the illusion of actually having multiple operating systems or multiple separated compartments within that single OS image. So there are various examples of this. Perhaps the best known version is the Solaris Zones, but on Linux there's also the vServers project and now the OpenVZ project which are providing that kind of functionality. And basically what these systems do is they group user processes into resource containers. And so if you're in one resource container, a new type PS or top, you'll just see the processes in that container and you won't see what's going on in any of the others. So it's kind of like churroton steroids. So rather than just hiding off a tree of the file system, it's also hiding off bits of the process space and things like that. But the trouble with this kind of approach is it's quite hard to get strong isolation. So if you're in one compartment, it's not too hard typically to be able to break out and get into another compartment. Certainly if there's some sort of exploit on the kernel, you'll be able to escape and get into another compartment. Or if you manage to trigger some sort of kernel bug, then you'll probably take down the whole system and lose all of the compartments. It also means that all of the compartments have to be running exactly the same version of the kernel. The other problem that you can have is it's hard to get good temporal isolation between the compartments. So there are lots of shared resources with an operating system kernel, and it's very hard to provide quality of service guarantees to one particular compartment. So an alternative approach is to use a technique where you run multiple operating systems on the same or multiple kernels and operating systems on the same physical machine. And there are a number of ways of doing this. The most common way is to use something like VMware, Virtual PC or QMU. And they all provide effectively an emulation of a standard PC platform. So you can just install an operating system within the virtual machine and then just run it. And in most of these systems, effectively you have a user process which is running the whole of the virtual machine. The trouble with doing this is that the x86 processor family was never designed with virtualization in mind. In fact, it's about the hardest architecture to virtualize that we've come across because we've been porting Zen to a number of different platforms. And it's just hard to do it efficiently on x86. So that technique you can do, but typically there's quite a performance cost associated with it. So a few years ago, when we were starting on Zen, we were trying to think of ways that would enable us to run multiple operating systems, multiple kernels on the same platform to do it with high performance. And it occurred to us that the best way of doing that would be to be prepared to make some modifications to the operating system kernel to make it aware of the fact that it's running in a virtualized environment. And then it would work sort of cooperatively with the virtual machine monitor where the virtual machine monitor or hypervisor would be enforcing protection, ensuring that the kernel can't do anything bad. But because they're working together, you should be able to get better performance. And that's where we came up with the approach for Zen, where we require the guest kernel to be ported to run over Zen. Once you've ported the kernel, all of the applications, all of the user space stuff just runs completely unmodified. So if you think about that in terms of, say, a Linux distribution, it's really just a case of dropping down a new kernel package, a new kernel RPM, and then the Zen RPM rebooting, and then your system is running over Zen. So we try and keep the modifications as small as possible, just what we need to to provide good performance. So it's actually very close to x86 just with a few key differences. So what's virtualization good for? Why has people suddenly got interested in it? Well, one of the key ideas is if you look in a typical data center with thousands of machines, you find that the average utilization of those machines is typically very small, just 10%, something like that, because each application is running on its own operating system, typically. Particularly in the Windows world, that's a very normal practice. You don't want to run more than one operating system on a single, more than one application on a single operating system. And so you can save a lot of equipment costs just by consolidating multiple things which were previously running on different physical machines all into their own virtual machine running on top of a single platform. Then the other thing you can do is if you've got this virtual machine relocation feature, if you get some warning from the hardware that the hardware is about to fail, that's because the fans just stopped or the CPU is overheating, then the virtual machine with all of the applications running in it, you can then migrate that to a different physical machine keeping all of the TCP connections open, all of the users of that virtual machine probably won't notice that anything is happening. So we can move that machine, virtual machine off, and then take the machine down for service, replace the fan or whatever needs fixing and then perhaps move the workload back again. The other thing you can do is to dynamically make use of this VM relocation facility to balance workload between different physical machines. So if you've got a situation here where you've got three virtual machines, but the machine is overloaded and perhaps you're not meeting the service level requirements for those applications. If you, we were looking at transactions on a database, you might find that the latency of the transactions is too high. So you decide to move one of those virtual machines off to a different platform. And you can do that. Again, moving it across with very minimal downtime, perhaps just tens of milliseconds as you achieve that move and then the virtual machine is running on the new platform and you've managed to balance the workload. So this is quite a commonly used feature of Zen, particularly by people who are using Zen for running virtual dedicated server environments. So getting a rack of machines and then chopping the machines up perhaps into 10 or 20 virtual machines per physical machine and then selling those to customers. And before, perhaps if they were using UML or some other virtualization technique, they'd find that they couldn't have as many virtual machines on each physical machine just because the overhead of the virtualization of those techniques is higher. So immediately they can get more virtual machines per server. But then if they can make use of this dynamic load balancing capability, if customers start making use of their virtual machines, they can then balance the workload around. So if you had say one machine with 20 customers and then if you had 10 machines rather than just having 200 you might be able to have 400 or 500 customers because you can move the workload around. So the other big thing that virtualization enables you to do is it enables you to stand outside of the operating system and look inside to see what's going on. And you can use that to enforce all sorts of things. But enforcing security policy is a good example. You don't necessarily need to trust the systems administrator of each of the virtual machines to have correctly configured the firewall inside the virtual machine because we can do it in the virtualization layer. So we can protect the virtual machine. We don't have to trust the administrator to do that. It's kind of like having a firewall out on your network whereas here by having it in the virtualization layer we can implement that on behalf of the virtual machines. We can also do things like virus scanning. We could look inside a virtual machine to check for viruses. It's perhaps better than running that in each of the or relying on each of the administrators of the machine to have the antivirus software set up correctly. And similarly doing things like backup. You don't need to trust the operator of the particular virtual machine to be doing that because you can do it on their behalf. Another nice thing you can do with Xen is to actually watch the virtual machine and see if it's doing anything bad. And in particular one of the things which has recently been done with Xen is this idea of having immutable memory so that a kernel can tell the hypervisor that it's giving up access to a particular page or giving up right access to a particular page. And it wants to irrevocably give it up. If it ever tries to write to it, it should fault. And the hypervisor can enforce this. So what a kernel will typically do when it boots is to give up right access to all of its kernel memory. All the kernel text memory. And that means if someone does manage to take over the virtual machine and perhaps try and install a root kit, it will try and write to the kernel memory that fault will be trapped and then the virtual machine will be stopped and you can take a crash dump of it. So it's quite a useful security feature. So spotting virtual machines that have gone bad and then saving them off so that you can poke around and look at them later. So if we look at Xen 2.0, which came out in November 2000 and would have been 2004. So we have secure isolation between virtual machines. So we don't trust each of the virtual machines. We assume that they're malicious, that they've been taken over and we can't trust what they're telling us to do. So we need to enforce protection between them. We also have quite good control over the resources consumed by each virtual machine. We can bound the amount of memory it has, the amount of CPU time, and also the amount of network bandwidth and IO bandwidth they can consume. So we do require the guest kernel to be ported to run over Xen. But everything else, all of the user level stuff just runs unmodified. And so there are plenty of operating systems that have been ported to Xen. If you're looking at Xen 2.0, there's ports of Linux 2.4, Linux 2.6, NetBSD, FreeBSD, Plan9, Solaris. There's even various things like Netware and stuff like that have apparently been ported. So the advantage of taking this approach where you have the kernel and Xen working hand in hand is that you get execution performance which is really quite close to native. What we were trying to achieve with Xen is to get the performance so close to native that you would just have it on all of the time. You'd have virtualization available for whenever you wanted to take advantage of it. So that would be the normal way of running a system who would be to have Xen booted. And perhaps if you were only running one virtual machine on it, you'd never actually take advantage of it. But it was there if you wanted it. And certainly in a server environment that seems a good position to be in because you never know when you might want to do one of these VM relocations or to consolidate work onto a server. So looking at the para-virtualization approach we've taken with Xen, so we make some changes to the architecture dependent code of the operating system to make it there. And there are various levels of Xen awareness you can add. So one of the things that you've been doing is making it so that you can start off with a very small number of modifications and get your kernel ported and up and running. And then as you want to get more and more performance, you add more code to take advantage of various optimizations. So one of the key things is that the kernel has to call in to Xen to do certain kinds of privileged operations. So if it wants to change page table base or something like that, it can't do it directly. In X86 terms, it can't write to CR3 directly. It needs to call in to Xen to ask it to do it on its behalf. And by this approach, we avoid having to do any of the binary rewriting which other non-paravirtualized virtualization techniques have to do. And binary rewriting is certainly something which is very tricky to do. It's very error-prone, hard to make robust. And so we manage to avoid all of that. And also we get to minimize the number of privileged transitions we have to do in Xen by designing the interface so that every time we call down in to Xen, we get as much work done as possible. So for example, when we're doing a context switch, we call down in to Xen once and get a whole series of things changed. So obviously we'll change the page table base. We might have to change what's loaded into the debug registers. Perhaps segment registers would have to change. There'll be a whole bunch of stuff which would normally require individual traps that we can do in one go. There are other advantages of modifying the kernel and actually making it understand that it's running in a virtualized environment. So a good example of that is having a difference between wall clock time and virtual processor time. So wall clock time goes up as you'd expect. Each second passes as you'd expect. But virtual processor time only goes up when this particular virtual machine has access to the CPU. So you might have multiple virtual machines running on the same CPU and the time being multiplexed between them. And the reason why it's desirable to have both kinds of time is that if you don't, you can certainly confuse the operating system scheduling internally because you have to make it aware of the progress of real time because otherwise if anyone's logged in to one of these virtual machines and they type date, they'll see that time isn't going up as they expect. So you have to fake out the notion of wall clock time to them. But imagine you've got two virtual machines running on the same physical processor and each of them has two processes to run. And you can end up, whenever you schedule one of these virtual machines, you'll probably send two timer interrupts to it back to back to advance the wall clock time in the machine. And what typically happens is that the operating system will think that it's just given a process a 50 millisecond time slice, but actually it didn't have the processor during that time slice. So it can really interact badly with what the operating system is trying to do. It's much better to modify the operating system and just be straight with it, tell it the difference between wall clock time and virtual processor time, and then it can fix its scheduling. The other advantage is if you expose real resource availability, the operating system is usually much better placed to figure out how to use those resources. So if you tell an operating system it's got 512 megabytes of memory, but actually only give it 256 and then try and doing paging in the background. Typically the performance will be terrible. You'll have things happening such as the operating system decides it wants to swap a page out to disk, but the hypervisor has already swapped that page out to disk. So the hypervisor, the operating system causes the hypervisor to read it back in just so that it can write it out to its disk. And so you really want to avoid that kind of interaction. The best way is to be straight with the operating system, tell it you've got 256 megs, you figure out how best to deal with that. And then you can dynamically add and remove memory and tell the operating system what you're doing. It knows far more about what's going on in the system and can do better optimizations. So if we look at the architecture of Zen 3, we have this thin layer which is the Zen virtual machine monitor. So Zen currently supports a number of different platforms, 32-bit and 64-bit x86 platforms. It's also been ported to ICE and Power. And recently they were announced that the Spark port was gonna be being started soon too as well. So we have this thin layer where we try and keep it as small as possible and just put the bare minimum we need in there to provide the protection and multiplexing between the different virtual machines running above. So we really try and keep it as thin as possible so that we can do close inspection of it and hopefully avoid any security problems. Everything else we try and push above the line into virtual machines. So when a Zen system boots, you need to load at least one virtual machine onto the hypervisor. And in this case, we typically call this virtual machine zero, domain zero. And this typically is the place where the native device drivers for the platform live. So you'll have your real ethernet device drivers, your disk device drivers, frame buffer, all of that sort of stuff will typically go into main zero. It doesn't have to be that way, but that's the most usual configuration. And then up in user land, you'll typically start a bunch of demons, which are basically the control software for Zen. So that's what you'll use if you want to start another virtual machine. You'll issue some command in that, using the XM command or the web UI or something like that within domain zero user space. And that will talk down to the hypervisor and allocate new resources for other virtual machines, load kernel images into them and then unpause them and allow them to boot. So we have other virtual machines over here. And these guys, at least in this instance, aren't seeing any of the real hardware. We could enable them to see real hardware devices. We could perhaps give one particular network card or one particular SCSI card to a given virtual machine. But in most cases, you want to use virtualized devices. We call these front end drivers. So you'll have a virtual, typically have one or more virtual ethernet, one or more virtual block device drivers. And they talk over these orange connections, which we call device channels, which are a kind of a very high performance, a synchronous shared memory transport, to corresponding back end drivers. And on the next slide, we'll look at that in a bit more detail. But if you think about it for a network device, effectively you can think of this as a crossover cable. So packets go into the virtual ethernet device here and they pop out of a corresponding virtual device, in the virtual device back end within this domain. And then we can control what happens to the packets, whether they go to a, get bridged onto a physical ethernet interface or whether something else happens to them. Then also in Zen 3, one of the key features which we introduced over Zen 2 was the ability to run SMP guests. So Zen has always run on an SMP machine. But before, if you had a full processor machine, you'd want to run four uniprocessor guests to be able to make use of that machine. Whereas now you could have one guest, which uses all of those physical processors. You can even start up SMP guests, which have more virtual CPUs than you have physical CPUs in the machine. This is a terrible thing to do from a performance point of view, but it is a great way of finding kernel bugs. We've found some interesting problems starting 32-way virtual CPU guests on two CPU boxes and all sorts of missing locks get shown up in that configuration. Now the other big thing that's happened is the x86 CPU vendors Intel and AMD have been adding extensions to their platform to enable or to make it easier to run unmodified operating systems. So ones that haven't been ported to Zen. So you can think of these as legacy operating systems which have yet to become enlightened and be ported to run natively on Zen. We can actually support them by making use of the Intel VTX and AMD SVM extensions. And effectively, what this does is it enables the processor to trap when it trips over some of these instructions which can't be properly virtualized. And then that calls back down into Zen and then Zen can figure out what to do. So now, if you've got one of these new CPUs mentions, you can now run Windows and other operating systems completely unmodified. So there's a performance head associated with doing that. It's much better to port them to run natively on the platform. So looking at IO again, we have these IO spaces where we can assign to a given virtual machine access to specific devices. So we can specify a PCI bus path and then that virtual machine, when it does its PCI bus scan, we'll see that device and then can use it natively. And we have a way of virtualizing the PCI configuration space and also virtualizing interrupts to ensure that Zen remains in control of the interrupt hardware. And then we just provide these virtual interrupts to guests. And then we have the device channel mechanism to enable us to export the physical hardware or to make active access to the physical hardware in other virtual machines. So in the network case, a typical configuration, you'd use the software bridging infrastructure, which most kernels have certainly Linux does. So you could have these virtual devices within domain zero and then your physical devices and you could put them all on a software switch implementation, software bridge implementation. And then effectively it would be like all of those virtual machines were actually on the physical network. Or rather than just using bridging, you could use some of the, you could use routing or you could put IP tables firewall rules in and really control what kind of packets can go between those virtual machines. So we can actually insert rules to ensure that a virtual machine can only use a particular source IP address. All of those kind of things can be enforced using the normal tools. And if you look at block devices, you can export any block device which the OS knows about as a virtual disk to another virtual machine. So you could export a physical partition, you know, SDA4 or whatever in Linux speed, it could be a loop device which is connected to a file on some other file system or you could use LVM and export some volume group. That's the preferred way of doing things. Just to note at the bottom, one of the things which is happening over the next year or so is manufacturers are producing smart IO devices that we can enable guest virtual machines to talk to directly in a safe fashion. And that way, we can get much higher IO performance than having to do this pasting through domain zero. And so that's one of the big improvements that's coming with new hardware. So just to run through a few performance figures, so four different benchmarks representing the four different clusters. The bar on the very left, the darkest bar on each cluster, that's the native performance of Linux running on the bare metal. And then the bar next to it is the performance of Zen running, of Linux running over Zen on the same machine. The first cluster is running the spec Integer 2000's Integer Suite. And basically that's hard to, it's hard to screw up. It's spending a lot of time in user space. Pretty much any virtualization technique you use will do pretty well on there. And we see no performance head at all running that kind of workload. And there are many enterprise workloads which look quite a lot like that. So things like running the spec JBB benchmark and things like that are all very CPU intensive, spend most of their time in user space. Some of the other benchmarks start doing more IO. They start calling into the kernel more, doing more process forks and exits and things like that. And they put more stress on the virtualization software. So that's the next cluster is the time to do a Linux build. And we do see some overhead here of a few percent. Other virtualization techniques take a rather bigger hit. And then if you look at Postgres being exercised by the open source database benchmark online transaction processing suite. Again, we're seeing something like a 6% performance hit there. But it's all kind of acceptable. Most people will be happy to run with that performance hit the benefits available to them. And then the final cluster is running the spec web 99 benchmark against the Apache 1.3 web server. And again, we see really quite minimal overhead. So one question that people often have is if you're running multiple copies of the operating system, isn't there some significant overhead to doing that? Operating systems are quite big, heavyweight things. Well, the reality is that actually operating system kernels are pretty small compared to the applications that you run on them. And so the actual overhead really isn't very big. What this chart is trying to show is up to a 16-way system here. It'll actually scale out rather more. But we're comparing the performance of running one copy of Linux and running two copies of the application on the extreme left compared to running Zen and then two copies of Linux each running one copy of the application and then two, four, eight, 16-way. And we see there's very little difference between the two. The overhead of having an extra copy of the Linux kernel and G-Lib C and things like that is kind of small compared to the size of Apache and the dataset and the buffer caches and things like that. So if we do look a little bit at how Zen works, I'll just go through this pretty quickly. But there are three things which we're trying to protect from each other. We have user space and the kernel. Obviously, we need to protect the kernel from the user space. And now we've introduced Zen into the system and we need to protect Zen from both the kernel and the user space. So there's three things we want to protect whereas normally we just have two. Fortunately, page-level protection doesn't, it's not very easy to do that on x86 but we're able to use some of the segmentation features which are still in and available on the x86 architecture. Perhaps haven't been used in a few years but they'll still work just fine. And what we do is we create different segments for the different parts of the address space. So for example, in a typical Linux configuration you'd have a ring three segment that's the least privileged protection level covering the three gigabytes of user address space. And then ring one which would be where the kernel runs can cover the kernel area and the user area and then ring zero, the most privileged thing in the system has got access to the whole area. And that works well and gives good performance. Fortunately, on 64-bit, AMD actually made the x86-64 architecture even harder to virtualize than the 32-bit x86 by doing away with some of the protection that these segment registers provided. So what we do in this scenario is we have this much bigger virtual address space which is very convenient for Zen because in the 32-bit implementation we have to try and shoehorn ourselves into a relatively small area virtual address space whereas on x86-64 we just help ourselves to eight terabytes because no one's gonna miss it. And we end up with the kernel growing down and then Zen sitting in this sign extended address space sitting at the bottom of the top half of the address space helping ourselves to a mere eight terabytes. And what we need to do to provide protection between these three things is to use page protection tricks to do that. And the way we do that is we run both the user space and the guest kernel in ring three but with different page tables and user is tricks to switch between them. And we've come up with a pretty efficient way of doing that which particularly on AMD Optoron systems which have something called a TLB flush filter there's actually very little overhead in doing that. Now one of the hardest things to para-virtualize is the MMU and in most systems choose to use a technique called shadow page tables where the page tables that the guest is manipulating are not the page tables that the hardware is actually using. And so there are two sets of page tables that you're trying to synchronize between the two and it's quite a pain to do and there's quite a memory cost and performance cost associated with that. Whereas with Zen para-virtualize guests we use it called direct mode MMU virtualization. And the page tables that the hardware is using are the page tables that the guest is manipulating but obviously then we need to control how the guest can manipulate those page tables. So obviously one consequence of this is that you must call down into Zen whenever you want to change page table base and then Zen has to validate that page table but we have a scheme that enables us to do that in an incremental fashion and so for something like a context switch we know that that page table has already been validated and we just have to track updates that occur. So there are various rules that Zen has to enforce one is that a guest may only map pages that it owns so it can only put a PTE which points to a page that it owns. There's a caveat there because there is a mechanism in Zen for allowing consenting guests to share pages for communication purposes but it's a special case on the mechanism. And the other rule is that pages containing page tables a guest can only create a read-only mapping to a page containing a page table because otherwise if it had a rightful mapping it could go behind Zen and then break rule one. So we just ensure that those two axioms are always enforced and we can do that in efficient fashion. I'm gonna skip a few slides here which goes into more detail of how we actually do that but the bottom line is that a guest can read its own page tables but it can't write to them directly and if you're doing something like a process fork or an exit and wanting to do bulk updates to page tables then there's a mechanism we call unhooking which gives a guest temporary right access to a page table page in a safe fashion knowing that it can't possibly get into the TLB and allow us to do these bulk updates without calling into the hypervisor each time. The net result is again it's a similar kind of graph to the last time where we've got the native Linux versus the Zen, the two bars next to each other looking at the page full time and process full time and we certainly take a hit for doing MMU virtualization but it's a much smaller hit than any of the other techniques using shadow page tables or anything else and this is the kind of benchmark which is really pounding the hypervisor really hard but we still kind of do okay on it. So to be able to support SMP guest kernels we have multiple virtual CPUs which we then can schedule on physical CPUs and the CPUs communicate with each other by sending virtual IPIs via Zen. We enable virtual CPUs to be hot, plugged and unplugged so you can dynamically add CPUs to a guest just like you can dynamically add memory to a guest. You can boot a guest with four CPUs and 256 meg of RAM and then add more memory and add more virtual CPUs as the load increases and also take them away again if they're not being used. One of the things to point out though is that many applications we've looked at actually have very poor SMP scalability even on small systems like two-way and four-way. We see this even running on native without Zen there. So one common usage of Zen even in the Zen two days before we had multiple virtual CPUs what people used to do is to get say a two-way or four-way box and then to chop it up into pieces so that each they could run a separate kernel on each physical CPU and then run multiple copies of their application and often find that that gives rather better performance than having a, letting one kernel see the whole of the machine and so that's always worth figuring out if that's the best way of running your application. Even for things which you'd expect to scale quite well like Apache and Postgres, it turns out you're much better off running multiple copies if you can. So the power of virtualized approach has got a number of benefits that really enables us to get pretty good SMP performance. It's very hard to do if you're doing a fully virtualized approach because there are a lot of, the kernel can give the hypervisor a lot of help and avoid a lot of calls and traps down into the hypervisor by making a few changes to the kernel code. So with the advent of VT and Pacifica, Pacifica has now been renamed to SVM but it's the name of the AMD virtualization extensions. We can now run unmodified operating systems. So just let's go straight to the diagram and look at how that works. So we have Zen, we have perhaps domain zero and then one or more other para-virtualized guests. These are guests that have been ported to run natively on Zen. And then using the new CPU extensions, all of these guests run within what's called the root VMCS or the root context. So that's the most privileged thing in the system. But then you can provide these nested contexts where you can run other virtual machines which will then trap out back to the root context. So in this example, we can have say a 64 bit hypervisor and then run a 64 bit unmodified operating system. So it will just start off in 16 bit mode, run the BIOS which will have dropped in this virtual BIOS. The one we use actually comes from Box. And we have this IO emulation which actually comes from QMU where we're emulating a PCNet32 device and an IDE DMA controller and a Citrix Logic graphics card. So by doing this emulation, the machine will boot and then come up into 32 bit mode and run within this nested context. And whenever it trips over one of these privileged instructions, we'll exit back to the hypervisor where we can patch things up, figure out what to do and then return. So the hypervisor's having to do quite a lot of work and that's why we're not getting this batching of calls into the hypervisor which is why we take a performance hit. But as the hardware gets better in future revisions, that cost should start coming down. But it's still quite usable. You can run Windows quite happily and the performance is quite usable. The other thing you can of course do is to run on a 64 bit hypervisor, you can run 32 bit guests as well. But one of the key things you could potentially do is to start putting the power virtualized device drivers into the guests to get better IO performance. But actually what we're doing is doing quite a bit of work to improve the IO emulation so that we can actually get that working well because we lose quite a bit of performance doing IO at the moment. But we believe we can fix that up and get really quite good IO performance. So one of the pains that we have to deal with is we have to use these shadow page tables because we can't allow the guest to control the page tables directly. And so the hardware will be pointing at some set of page tables which are managed by the MMU. The guest will be right into its memory thinking it's manipulating the page tables directly, completely oblivious to the fact that it's not actually talking to the real hardware page tables. And then we have to do a lot of work to keep the two in sync. Let's move straight on to looking at virtual machine relocation. So using this technique enabling us to move a running virtual machine image between two physical nodes. It's good for both availability and also for doing load balancing. So we can move it with very low downtime. So a couple of assumptions about the environment that we're using VM relocation in. We assume that you've got network storage. So we assume you're talking to some external device and you can access your storage even when you've moved to a different physical node. So we're assuming you've either got something like an NFS server or using a SAM or using iSCSI or if you are using local disk you could be using something like GNBD or some other network block device protocol like DRDB. We also assume you have good connectivity. We're assuming there are machines within a, perhaps within a single server room. So we assume you've got gigabit ethernet between them or better. So there are various challenges in doing VM relocation. One is that virtual machines have lots of state. If you think about a gigabyte virtual machine, if you tried to copy that over a gigabit ethernet network that would take a minimum of eight seconds. Users are gonna notice an eight second outage while that copy is happening. And some virtual machines have really quite strict software ultimates, web servers, databases and particular things like game servers need to be servicing requests but really continuously and outages are no good. A key thing of this, if you're running some sort of cluster file system or something like that, an eight second outage will cause you to get kicked out of the core and then when you wake up at the new location calls complete chaos. So we need to minimize the downtime. The other thing we need to do is to bound the amount of resources we use to do the relocation because if you're taking 100% of the machine to perform the relocation then your application isn't getting to run and you might as well have just stopped it. So we need to get the balance right between the two. So looking at how VM relocation works we have a pre-migration stage where you look at the other host, figure out whether it's gonna be a suitable candidate for running this virtual machine, reserve resources on the destination host and then all of the real work happens in this stage too where we're synchronizing the memory between the new hosts, between the two hosts. So the virtual machine is continuing to run but when it writes to pages we need to transfer that updated page across to the destination. And then after some number of iterations we decide that the memory is close enough to being synchronized. So we stop the virtual machine at the source and then copy the remaining pages across, the register contents and things like that and then check that everything's arrived safely on the other end and then start the virtual machine in the new location. So looking at this pictorially we have a source host and a destination host and the gray represents all of the memory on the source host, the virtual machine is still running there and the empty box is this container we've created to hold it on the destination. So as the virtual machine runs we're copying the memory across. And the orange boxes are as the application runs it's touching memory, dirtying it and we realize that this memory is out of sync we're gonna have to resend it again. So after one iteration of this algorithm we've done pretty well we've filled in a lot of the gray over here but there are a number of holes and we have to go back to the top and start resending those pages. But the good news is that there's less to send this time so it's quicker so hopefully the guests will have will dirty less pages this time around and it's running dirty pages still but hopefully fewer and we do some number of iterations of that watching how we're doing and making adjustments. But let's say in the simple example we get to the bottom of the second iteration we say, okay, there's only four pages that need to be synchronized so let's stop the virtual machine then what we'll do is to copy those final four pages across and the register states of all of the virtual CPUs then look at it and make a decision that we're going to everything has arrived safely and we're going to start this virtual machine and then kill the one at the source and throw it away. Now the application's running or the virtual machine's running on the new node. So looking at an example of this we have a web server serving data up to 100 concurrent clients pushing out data at 870 megabits a second it's this line along here that at this point here we decide we're going to relocate this virtual machine to a different physical server happens to be an identical one and we're going to commit up to 10% of the machine to do that so we start this copy operation and we see the performance drop by 10% that's because we're using 10% of the CPU 10% of the network bandwidth to synchronize the memory between the hosts. This synchronization process is taking quite a long time because effectively there's only 100 megabits available to us and it's sort of 60 something seconds later we get to here start that second iteration and then start going into further iterations. We realize that 10% isn't quite enough to do this relocation we're not quite keeping up so we increase the resources to something like 15%. This is good enough we're going pretty well I think we get up to something like 16, 17% and then we stop the virtual machine copy the final remaining pages and the register contents across and started on the destination host and now we're back to 100% full performance. If you look at the downtime as observed from any of the clients of these web servers 165 milliseconds is the outage they observe which is probably not going to be noticeable to the users of those virtual machines. So another example is we had a machine where we worked out what the 100% load capacity was running the spec web benchmark. The number of simultaneous clients that could be served you could run the spec web load against. We then backed off to 90% of the total capacity of the machine and then started running that workload and then doing migrations backwards and forwards between two machines and found that the good news was that we still passed the benchmark while bouncing the virtual machine backwards and forwards and the downtime observed from any of the clients was 200 milliseconds so again, well within spec. So now to a really important mission critical workload of a Quake 3 server and bouncing it between a number of machines while various grad students were playing Quake. We observed that the downtime looking at the inter packet latency when we're doing these migrations is about 50 milliseconds and certainly no one realized this was happening whenever we bounced it between two servers no one could really tell that this was occurring. So if we look at the current status of Zen 3 things look, yeah there's a lot of green if you're looking at x86 then both 32 and 64 bit. If you're looking at IA64 and Power those ports are less well developed they don't have SMP guests they don't have save restore migrate and things like that yet. But there's a lot of work going on to try and get them up to the same level of capability. So if we look at what's coming next we're putting an awful lot of effort into making full virtualized, unmodified guest work as well as we possibly can we want to be able to run windows really well. There's also work going on on the various control tools for Zen to provide nicer user interfaces and make it easier for people to play with Zen and install guests on it. There's also a lot of work we need to do doing automatic performance tuning and optimization. I'm afraid that today to get the best possible performance out of the Zen system you need to be quite a guru and know what's going on and know how to assign virtual CPUs between physical CPUs and stuff like that. We really want to do better in the next release and make more of that stuff automatic. Also running on bigger machines Zen today boots on I think the biggest machine that's been booted on is a 64 processor box but there's a lot of work we can do to make the performance better by making Zen more numerous aware. There's also some cool work going on providing virtual frame buffer support. In particular, there's some really nice work which does open GL compositing. So you can actually have guests which are doing open GL rendering and then compositing them onto a single frame buffer. So that's the best way of getting good 3D graphics performance. As we talked about before there's going to be hardware coming along that enables the guests to talk to the hardware directly in a safe manner. The so-called smart NICs and we're going to be able to take advantage of that to get improved networking performance from guests. Now one of the nice things about Zen is that as well as all the work going on in the community and also within various companies to drive Zen development Zen has become a major platform for doing operating systems research in universities around the world including Cambridge. And there's lots of fun stuff being worked on which perhaps isn't going to make it into 3.1 but might make it into Zen 4 and future versions. So one of the nice things is that while Zen is already pretty good for kernel debugging you can use GDB and attach it directly to a kernel in a guest. You don't need KGDB or anything like that. You can just attach things directly but there's been some really nice work demoed by Guy in Cambridge for doing whole system debugging where you can run a whole number of virtual you can set up your three tier web application on a Zen box, have the whole thing running, hit a break point and then make use of Zen's deterministic replay facility to enable you to step backwards all of the virtual machines in lockstep to see how you got into this bad situation. So it's awesome for debugging those kinds of applications. Also being able to do things like software implemented hardware fault tolerance where you run two virtual machines in lockstep on different physical CPUs and if one fails the other can immediately take over. Also using Zen to build multi-level secure systems so you can really compartmentalize information and ensure that it can't flow between different virtual machines. And there's some work done by some guys in San Diego for who wanted to run extreme numbers of virtual machines on the same Zen box to basically run a whole honey farm on a single Zen box. They're running many thousands of virtual machines on the same physical box and they do this using a VM forking technique where each virtual machine is a copy on right version of its parent. So there's a lot going on. I think you find that Zen is a pretty complete and robust hypervisor. Lots of people using it in deployment situations today in production environments. It has really quite good performance and scalability and gives you quite good control over how resources are allocated between virtual machines. There's a lot of people working on Zen. If you're a developer with any sort of kernel hacking skills or even want to work on some of the user tools that make Zen actually fly, then please get involved. It'd be great to have more people contributing. And there's a lot of people, a lot of vendors helping out on Zen. So all of the chip vendors, all of the various hardware vendors are all making sure that Zen works well on their platforms. If you want to play with Zen, then go to the zensource.com community website. You can download a demo CD. That's a live ISO with a number of different Linux versions on it. There's older ISOs that also have NetBSD and I think there's even a Solaris one around as well now. Or Zen is now a standard part of Fedora 4 and also the Fedora 5. But that tends to be a version which isn't as up to date. But the one in the pre-release version of SUSE is very new. So that's it. Thanks, there's a couple of minutes to take questions. So getting the Zen patches for Linux back into the kernel tree is certainly something we want to do. We've had someone working on this. In fact, I think Linux is about the only operating system which Zen has been ported to where the patches aren't already in the mainstream kernel. It's certainly happened with most of the others. And there's work going on to do that. There's patches floating around on the Zen merge mailing list. If anyone wants to contribute and help with that, that would be great. Hoping to submit, start submitting things in the next few weeks, really. How has Zen tested? Well, one of the good things about, when the university spun off the company to try and drive Zen developments is we've been able to invest in decent QA. So we actually have quite a large QA lab. There are machines in there running tests 24 hours a day. Yeah, it's quite hard testing something like Zen because you need to be able to start multiple virtual machines and run different workloads within those virtual machines to have communication going on with the outside world over the network interfaces and the disks. And so we actually put a lot of effort into getting an environment that would enable us to do automated testing. And these machines just run 24 hours a day picking different virtual machine, different kernels, different applications and running them and perhaps messing around with control tools at the same time, adding and removing memory, adding and removing virtual CPUs. And so using these tools, we can make sure that we get a good quality before a release. What happens if domain zero crashes? That would be bad. You'd lose the whole machine. So what we generally do is in an enterprise configuration of Zen, you wouldn't allow users to log into domain zero. You would just have your control tools in there. One of the other features I didn't talk about today is you can actually run the domains with the device drivers in, in their own domain. And then if that driver domain fails, you can just throw it away and restart it. And so you can recover from things which would normally crash the entire system like a device driver failure. Typically within a few hundred milliseconds, you'll have recovered the system, reconnected all of the guests' virtual interfaces and they will not notice that anything has happened other than a brief network outage or disk outage. So you can use Zen to make systems more highly available than they were before. So did you say is it possible to clone? So cloning is very similar to doing a migration or a save restore. The only difference is you don't shoot the original in the head. And yes, you certainly can do that, but you have to be very careful. The reason we don't allow it with the standard control tools is that people do it and then forget that they're connected to the same virtual disk image. And then if you're running both of them, it's game over. That's a guaranteed way of killing a file system is to have two guys mounting it. So if you're going to do that, you need to coordinate that with taking a snapshot of the file system and having an independent copy of it. You can certainly do things like running a web browser in its own virtual machine. That would be quite a reasonable thing to do. There's also work which has been done by IBM who have come up with a framework that enables you to do very fine grained access control on all of the various hypercalls for calling into the hypervisor and being able to enforce various security policies like Chinese wall and stuff like that between the various guests. So have a look at the, there's a Zen security mailing list, or I think it's the ZNSE mailing list, which there's a lot of talk about how best to do stuff like that. Any other questions? That's the question was, can you prevent software running within a virtual machine from telling that it's running in a virtualized environment? That's something we don't really try very hard to do because in principle it's impossible because you can always observe various timing effects. If you issue various hypercall, if you issue various operations, system calls within the user application, and time them, you'll be able to notice timing differences between running on bare metal and on Zen. And so because you can always tell, our general policy has been to not try and hide that. You could attempt to try and hide that, in principle, an application that wanted to know could always figure it out. And I think that's always gonna be the case, but we can do things like lie to it about the hardware that it's running on and stuff like that, but if it really wanted to figure it out, it probably could. So the question was, how many people have worked on Zen? Well, so there was the core group which worked for a very long time before the whole thing took off. So there was initially self and care and then a group of four people that worked for a number of years. So there's still an awful lot of code from that time in there. If you look at the change log now, I think there was something like 200 different email addresses in the change log. That was when we had the last Zen Developer Summit back in January. And at the turnout to that summit was something like 100 people who are sort of actively doing stuff. Well, it's not just applications, it's operating systems themselves. There are all sorts of problems with licensing models in this world. Like, do you pay for virtual machines which are suspended to disk and not even running? That's something which the owners of those bits of proprietary software have got to figure out what their licensing model is gonna be in a virtual world. And some of them are more enlightened than others. Interestingly, Microsoft are one of the more enlightened in this respect and that they allow you some number of virtual machines on each physical node. So today it's command line tools which you use to add and remove memory or there is a web UI for doing that but it's not particularly sophisticated. What is happening is there are lots of companies building management tools on top of Zen. ZenSource and various other companies are too. You know, there's various open source efforts as well and they provide more fine-grained control over memory management. Yeah, it's all open source. What we're actually doing is rather than running QMU as a user space process, we're actually running it in what we call a mini guest so we're sort of porting it natively to run on top of Zen. And also we'll probably change some of the device models so that rather than emulating an old PCNet32 device, we'll emulate devices which have, you know, checksum offload, jumbo frames and all of these kind of more modern features. And also you have to be very careful about what device you pick. You want to pick a device which supports DMA and stuff like that but which the device driver does very few operations to actually talk that cause exits back to the hypervisor. And you just pick your device that you emulate carefully and I think we can get very good performance.