 at the last session of the evening, this will be a full 25 minute session, so it's gonna go a little past 1900 hours. I'm pleased to introduce Roger Paul from Citrix, who will be talking about Things in the Zen Project. Roger? Hi, my name is Roger. I work for Citrix on the open source Zen Project, mostly I'm also a free VSD developer, and I also do a little bit of Linux and QM work when needed, but recently my main focus has been Zen, and I've been working on something that we call the PB8, which is a new virtualization mode for Zen that we plan to use for both guests and hosts. And I'm here today to introduce the differences between PB and PB8.0, and how that's going to change the Zen ecosystem. I usually start with a brief description of typical Zen domain, yeah, well, in any case. Here we have the hardware that contains the CPU, the memory management unit, all your PC devices, everything is right there. Then we have Zen on top of that hardware. Zen basically takes control over the CPU, the MMIU, some timers, the local APICs, and that's mostly all, it doesn't have drivers for PC devices, it doesn't have drivers for anything else, basically, for your disk, it doesn't have any of those drivers. All those drivers are inside of what we call the control domain, that's the first guest that's launched by Zen, and it can either be a Linux free VSD or Net VSD, basically. And usually inside of that domain, we have all the drivers for the different devices on the system, the PC devices, the network cards, the disks, everything is usually inside of the control domain. And then on top of the control domain, I put some tasks that you usually have running on your control domain, like a syslog, an xterm, xorg, yeah, whatever you want. And here I've added two guests also. As we can see in this picture, the guests inside of Zen are on par with the control domain, there's more or less share the same interface, and even the control domain is just a guest from Zen point of view. Here on the other hand, I have a description of what would usually be a type two hypervisor, that would be KBM, Beehive, virtual box. All those type two hypervisors share more or less the same design, which means that you have the hardware. And then on top of the hardware, you have your operating system, either Linux or Windows or Mac OS or whatever. And then there's a small module inside of your operating system, that's the hypervisor that takes care of controlling the hypervisor functions on the CPU. And here, for example, we can see that we have some tasks on what would be your host, your operating system. We have, well, I place the same task, basically, and we also have some guests. And I would like to notice the difference from the picture before, because here we can see that the guests share the same scheduler, for example, with the applications that are running on the hostOS. That's not something that happens with type one hypervisors, because it's completely isolated, and the scheduler is only designed to run VMs, not to run tasks. And here we can see that the guests actually compete for resources with the tasks that you have running on your host. So yeah, now I would like to speak a little bit about the current DOM zero interface and the limitations that we have with that interface. This interface was designed a very long time ago, in the 90s, so it's a little bit different from what we would do it now. We've been using that interface for a long time, but we reach a point where I think we need to change and we need to improve the interface that we provide to DOM zero in order to make it easier. So one of the key difference of the Zen interface is that when Zen was designed, there was no hardware virtualization extensions, which means that you could not virtualize the CPU, you didn't have any support from the CPU in order to do virtualization. That means that when Zen takes over the memory management unit, we have to provide a different interface for OSs in order to interact with that memory management unit. So that's one thing that's different on Zen and it's very intrusive because all the architectures have only one interface to the memory management unit and Zen was basically introducing another interface to the memory management unit on its 86. That's something that doesn't happen anywhere, so it was very intrusive in the terms of the modifications that you need to perform to the operating system that want to run on top of Zen. The CPU handling is done also completely different. The setup and the delivery of interrupts is also very, very different from bare metal. That's mainly because on PIV guest, on the Corent DOM zero interface, we don't provide a local APIC to the DOM zero, so basically we have to use another way to deliver interrupts. And finally, the ACPI tables are also quite different. Well, not really different, but the tables that we provide to DOM zero are not very good in regards of the actual description of the system. I will go a little bit into this. So yeah, as I said before, the MMIU is different on PIV guest because basically we have to provide a set of hypercodes that the guest can use in order to interact with the memory management unit and that's different from what's done on bare metal that you just have certain instructions that are used in order to interact with the memory management unit. This code, as I said, is very intrusive because you have to modify core parts of the OS in order to introduce all this Zen specific code. It's also limited to four KB pages only. You cannot use two megabyte pages or gigabyte pages, so that's quite a problem related to performance, especially now that we have systems with a lot of memory. And it involves using hypercodes in order to set up your page table. So it means that every time you want to do a fork of a task or something like that, you have to issue hypercodes to the hypervisor and the hypervisor has to create the page tables for you and it's very intrusive. And finally also, PIV guests cannot use what's called privilege instruction so they have to resort to the hypervisor in order to execute them on behalf of the guest. This also involves using hypercodes. Then related to CPU handling, there are quite a lot of things that are different comparing Zen to bare metal on bare metal, the boot time CPU discovery it's done using a table on ACPI that's called MADT. On PIV guest it's done using hypercodes. It means that you have to modify the boot code of the very early code of an OS in order to use hypercodes instead of ACPI in order to discover the CPUs. And again, this is very intrusive. Also the bring up of the secondary CPUs or native it's done using the local APIC. You have to send a set of APIs in order to wake up secondary CPUs and on PIV it's done using hypercodes. And also the hot block of CPUs on native it's done using what's called a general purpose event block that's something that came from ACPI and something that's also part of ACPI that are the processor objects so you basically recite an event from ACPI and then you scan your processor objects and detect that new objects are online. Again on Zen this is done completely different and we use something that's called ZenStore which is like a database that's shared between the guest and Zen and it's used to pass information between the hypervisor and Zen and between different guests running on the same system. The setup and delivery of interrupts it's also quite different on native. On native you basically receive all the interrupts from the local APIC and then the local APIC injects those interrupts into the CPU and there are mainly two different kinds of interrupts on each 86 the ones are called legacy PCI interrupts those are implemented as sideband signals that go into the local APIC and then the local APIC injects them into the sorry go into the IO APIC and the IO APIC injects them into the local APIC and we also have a newer kind of interrupts that's called MSI or MSIX that are implemented using in-band signals and are delivered directly to the local APIC this is wrote down by programming a certain address on your PCI device and the PCI device will write to these address when it has to trigger an interrupt and these address is trapped by the local APIC and it injects the interrupt into the CPU. The configuration of interrupts on PCI systems it's also done from the PCI configuration space which is a set of IO ports that you use in order to interact with your devices or a memory area that you also use in order to interact with those devices and yeah on PV this is quite different because as I said before PV don't have any kind of APIC so basically we cannot inject interrupts using an APIC at all and we have to inject interrupts into the guest using another mechanism that's called event channels event channels are something specific to Zen are called the para-virtualized interface that's used by Zen in order to inject events into the guest this again implies modifying quite a lot of the code in guest OSs in order to be able to implement this new interrupt interface that's only used by Zen and this also has the problem that it creates a lot of maintainership burden inside of OSs because you have to introduce a lot of code and you also have to maintain this code and to be honest this code is quite critical I mean the interrupt path are not something that you really want to be modifying in any OS and finally as we don't have any emulated PC configuration space on PV guest we also have to set up interrupts using hypercodes which means that well this whole interface it's very different from native and it's not a trivial amount of code so here I have the picture of what would be an interrupt injection to a PV guest you have your physical device this injects an interrupt into the APIC that's the physical APIC and that's controlled by Zen receive this interrupt and injects the interrupt into the guest using the event channels and the guest finally receives the interrupt and finally one of the things that it's also different on a PV DOM zero compared to bare metal is the ACPI tables I would like to speak a little bit before about the ACPI tables there are mainly two different kinds of ACPI tables one of them are called static tables that are used on boot and are very simple tables in memory that can be mapped to a C structure so they don't contain anything very complicated it's just static information that you can use during boot they are very easy to parse they are very easy to modify they don't contain anything weird in general but these tables are only meant to be used for very early boot information and most of the real information on an ACPI system it's provided using what's called dynamic tables these tables are written in a language that's called ACPI matching language this language requires you to have a parser inside of the kernel in order to be able to discover these tables and on these tables you actually find also static information about devices but they also contain methods that you can actually execute so they are a little bit more complex and this is where, for example, on these dynamic tables it's where you find the information about all your PC devices so most of the information on the system actually came from these tables and one of the problems is that on a traditional PB DOM 0 all these tables are passed as is to the guess which means that the information received by DOM 0 it's not the information that it should have received for example, you can limit DOM 0 to two CPUs but if it looks at the ACPI tables and the hardware system has 16 CPUs it would think that it has 16 CPUs because we simply don't fix that at all and we pass the tables as is to DOM 0 and another of the problems with the ACPI tables is that ZEN can only pass information from the static ACPI tables because ZEN doesn't have an AML parser AML parsers are big are required quite a lot of code in general in order to implement so ZEN never had and I don't know maybe that's going to change but at the moment it doesn't have an AML parser so it can only fetch the information from the static tables but there's some information that requires by ZEN in order to run that's inside of the dynamic table so the hot block of physical CPUs the CPUC states and the sleep states are inside of the dynamic tables and ZEN needs that in order to work properly so there's an interface from DOM 0 in order to pass that information into ZEN so basically DOM 0 has to pass the dynamic tables extract that information and tell that information to ZEN this is also quite costly because it involves modifying native drivers in order to pass all this information to ZEN and yeah as I said it will be possible for ZEN to fetch most of this information but that's one limitation on ACPI that only one operating system can execute methods so if ZEN executes any ACPI methods it has to execute all of them and we don't really want to do that so it will be possible for ZEN I think to fetch most of this information but it will still probably need some help from DOM 0 in order to execute ACPI methods so yeah this is more or less all the current limitations of PV DOM 0 or the more important ones and now I would like to speak a little bit about how we are trying to move away from all this and provide a new interface to DOM 0 the main points of this interface is that we want it to be as close as possible to native that's very important because it means that it's probably going to be faster and it's also it's also going to reduce a lot of the code that we need to put inside of OSs in order to run on top of ZEN if we have an interface that's very close to native it will mean that the code required in order to run on ZEN it's going to be much more less we only want to use hypercodes when there's no other interface that we can use I mean I'm sure that we'll have to use them from time to time it's not something that we can ready get rid of but we'll try to reduce as much as possible the number of hypercodes that we need to use from DOM 0 and finally we would also like to take advantage of all the new hardware virtualization extensions that came up on recent well not that much recent Intel and AMD CPUs maybe they've been around for almost eight years now so yes, so regarding the IOM MIU that's a very easy problem to solve because on newer CPUs you have what's called the hardware virtualization extensions on Intel that's BTX and on AMD SBM I think or something like that and these virtualization extensions basically allow you to create what's called a second stage translation which means that we can present a physical memory map to the guest that contains memory from different physical regions we can create a page table and I guess we think that that page tables is their memory map so that makes it very easy for us to provide transparent integration also using these virtualization extensions we can also provide the guest with a virtual memory management unit that's emulated by the hardware so we don't have to do anything there and this also allows us to use all the page tables that are supported by the hardware the different sizes like we can use one gigabyte pages with well if the guest wants to use one gigabyte pages it can use them and basically we don't need to modify the guest in any way it's just transparent from the guest point of view Interrupt management it's also quite important one of the things that we want to do with Interrupt management is provide the guest DOM zero with an emulated local APK and an emulated IO APK the local APK sometimes it's provided by hardware itself because there are newer hardware that's capable of emulating a local APK that's something I don't remember the name I know that's Intel hardware that has this and there's also AMD hardware that's going to came up later this year that will have this feature and IO APK will be emulated inside of Zen and we will be using the same code that we already use for HBM guest so it's not introducing new code into Zen it's just using the code that we already have there and finally we would like the configuration of Interrupts to be done using the PC configuration space so that means that we'll have to introduce some emulation code inside of Zen it's not going to be a lot of code but we'll have to introduce some things some traps for the PC configuration space in order for Zen to detect that the guest is configuring Interrupts and properly react to this so here we have another picture of what it will look like if we have a local APK inside of Zen the Interrupts from the physical devices would still usually be received by Zen on the physical APK and then Zen will inject them into the emulated local APK on the guest or I've also added a straight arrow from the device to the guest because if the hardware supports something that's called Posted Interrupts it's possible for a device to directly inject an Interrupt into a VM that means that the latency would go down quite a lot because we don't have to go through Zen the Interrupt will be injected directly into the VM and finally the problem with the ACPI tables it's a little bit more tricky to work around but I think that we found a way in order to solve this first of all we will provide a new MADT to DOM0 that will actually reflect the topology of DOM0 that means that DOM0 will see the number of CPUs that it can use it will not see the number of CPUs available on the whole hardware then we will also provide an extra dynamic table for DOM0 that will contain processor objects for these CPUs this is needed in order to comply with the ACPI aspect because ACPI requires you to provide processor objects for the CPUs on the MADT and finally we will hide the native processor objects from DOM0 using a table that's called STAO it means a status overwrite table and it can be used in order to hide devices on the ACPI namespace this table is actually under Zen control so we can modify the version of this table and we can add new fields if we need to so yeah, as a final note I think that we can manage to reduce a lot the Zen-specific code inside of the several OSes especially Linux has a lot of Zen code inside and we would like to get rid of that because it's quite hard to maintain it's very different from bare metal and it usually is, well, a cause of bugs in general because most of the H86 maintainers don't really understand the Zen code they change the native code and when they change the native code they break Zen so we would like to get rid of that it's easier for everyone we'll have less bugs to fix in Zen and it's going to be easier to maintain because we'll get rid of all of this code we'll also be able to take advantage of all the hardware virtualization extensions on the market so that means that we can take advantage of all the newer hardware and also we'll be able to simplify a lot the DOM0 interface which means that we can expect maybe new OSes to add support for running on top of Zen even as a DOM0 because the interface is gonna be very, very similar to bare metal so we expect that maybe someone is going to implement new DOM0s in the future yeah and that's all I would gladly take any questions now yeah What are the plans to both our architectures on the Intel and will these new sensors like ARM on this? We already have support for ARM oh, the question was if there are plans for new architectures on Zen upstream we already have support for ARM 32 and ARM 64 bits well, these new extensions are H86 specific so they don't really apply to ARM what I can tell you is that ARM already makes use of them because ARM, well, the Zen on ARM port was started very late so it already makes use of all these new hardware extensions that are present on ARM then regarding RISC and other architectures are you not aware of anyone working on that? If somebody contributes the code I'm sure it will be very gladly received the question is if you need a specific kernel for a virtual, for a para-virtualized virtual machine the response is no starting from Linux 3.0 all the Zen code is inside of Linux so you can just build a normal Linux kernel and it will have all the Zen support that was merged I think four years ago or five years ago so yeah, so yeah, I think that's all thank you very much