 That was just a test. OK. So if you come back the same time tomorrow, you get that one. All right, let's try that again. So today, I'm doing the Libvert KVM driver update. So what that's going to entail. So primarily, we're going to focus on the kilo features, obviously. I'm going to do a little bit of an architecture refresher. So a refresh of how the driver fits into OpenStack, how the driver works, and also how Libvert and KVM work as well. After that, I'll dive into some kilo features, primarily around the performance tuning enhancements for virtual machine guests we've done in this release, and also some Liberty predictions slash speculations. So obviously, alongside the general summit this week, we have the design summit, which is the primary forum in which we define the agenda for the next release. So none of these things are really confirmed, obviously. So obviously, Compute is one of the original OpenStack projects. It forms the basis of the infrastructure of the service offering along with storage and networking. I personally would argue that some of the other services we offer, things like salamander for monitoring, heat for orchestration are really infrastructure of the service plus, which in turn are based on these components. So like most OpenStack projects, Compute has a number of back ends. So not just Libvert back ends, but also things like vCenter, Hyper-V, and so on. In this particular case, we're talking about the Libvert driver, and specifically Libvert KVM. Libvert itself actually supports a number of virtualization technologies. So the Libvert driver itself also supports a number of virtualization technologies through the same piece of code. So alongside KVM, we have support for things like parallels, LXC containers, Zen, and so on. So Nova itself is, I say, relatively technology agnostic. So in terms of we're dealing with virtual machines, bare metal, even containers. But the API, just because of the way it was envisaged originally around virtual machines, is a little bit virtual machine specific. So there are some things that are very specific to virtual machines that don't necessarily apply to, for example, containers. When we look at the support matrix, and I'll talk a lot more about this in the talk tomorrow, the broader compute talk, but there's a list of things that a compute driver must support, and then there's a list of things that are optional. So what I'm referring to there, again, is that there are actions that a compute driver doesn't necessarily have to support, because they may not make sense for the underlying technology. So this particular slide, if we break it down at a high level, we have the Nova API. We have an RPC message bus for communication between the services. We have the conductor for interacting with the database, which means we don't need to have the database creds on every single compute node in the environment. And the scheduler responsible for placing instances. So primarily today, we're going to be focusing on that box to the right, the compute node itself. So that is where our instance is run. In the Libert KVM driver case, the Nova compute service that initiates the build of the instance is co-located on the same machine as the hypervisor, or as the QMU and KVM processors. In some of the other drivers, like the vCenter driver, that's not necessarily the case. So Nova compute may run in another ESX-IVM, or another hardware, and so on, and talk to vCenter separately. So Libert KVM, based on the latest survey results that came out this week, it's used in 85% of production over stack deployments. That is slightly down from the 87% last time we did the survey, so that's primarily if you look at the results based on uptick in the usage of ironic for bare metal management and also some of the container related drivers. So Libert itself, and in fact the entire Libert QMU KVM stack, provides a free and open source hypervisor effectively. So Libert is itself an abstraction layer providing an API to talk to hypervisors and manage virtual life machine life cycles. Supports many hypervisors and architectures, so primarily today, I'm kind of implicitly talking about Libert KVM slash X86, which I'm most familiar with, but we also have things like, in KiloCycle, for example, support for KVM on S390 was landed in the Nova driver as well. There's plenty of other architectures supported too. QMU is effectively the Swiss Army knife of virtualization, so it's a machine emulator. It's able to use dynamic translation, so basically run a virtual machine guest without any hardware assistance at all, but that's quite slow. So we primarily use KVM, the kernel based virtual machine, to provide hypervisor assistance or hardware acceleration to those guests. So I mentioned Libert being an abstraction layer. You might think, why are we using Libert instead of speaking straight to QMU? Who isn't Nova, my abstraction layer? So when we look at this tiny, tiny print, so this entire command is just the QMU KVM command line for booting a guest in OpenStack without any customization whatsoever. So this is the out of the box what you get. And that, as an API, doesn't work so great to talk to. And in the Libert project, there's a lot of experience in dealing with that. So it's been around for about 10 years, and a lot of stuff has been built into it. So in that previous command, when we look at what is all this configuration that we're getting in the default, it's configuring the CPUs, the NIC presentation to the guest, the disks, the bus type for those disks where they're IDE, Virtio, iSCSI even. PCI device pass-through potentially, or even without it when we're not using PCI device pass-through in the default case, we still have to expose things like even a virtual machine needs a virtual keyboard, for example, so we can talk to it. Consoles, the clock parameters, and so on. The Libert project also, and its ecosystem, also provides a number of tools that are very useful for working with these virtual machines. So Verge, a CLI for interacting with Libert. Primarily in an OpenStack environment, you won't use this to modify the guest, because Nova will just clobber those modifications. But it's useful for getting out the XML so you can see exactly what was created when you're troubleshooting, for example. Vert Rescue, for running a rescue shell on a virtual machine. So if on a Saturday when you're preparing for a presentation, you kind of kill your guest in some unfixable way, you can get in there and sort that out. Vert Sys Prep for creating templates, Vert V2V for converting from other environments, Vert Spotify for converting to thin provision. These are all useful tools that have existed for some time, and you can use with OpenStack images intended for use with Libert KBM. So I mentioned on my Nova architectural diagram that we be focusing on the box to the right. So kind of minimizing control plane here and zooming in on that. So on our hypervisor or our compute node, we have the Nova compute service or agent, which is responsible for communicating with Libert. It does that using Libert's exposed APIs. Libert, in turn, launches a QMU process for each guest. I'm being a little bit liberal here because sometimes when you're doing activities like snapshotting or migration or other things, there can be additional QMU processes related to that guest that are involved in doing that. But primarily I would think about it as there's a QMU instance at least for each guest. And then those QMU instances in user space are responsible for talking to KVM in the kernel to execute commands on the actual hardware as needed. We also have around kind of the fringes of the driver or useful tools. There are Vert.io drivers for providing para-virtualized device access to virtual machines. This provides a speed improvement. For example, the vast majority of users would primarily use Vert.io net instead of the default emulated device and similarly Vert.io block instead of IDE for performance reasons. So those drivers are included in the vast majority of Linux-based operating systems, if not all. They're also available for Windows users as well. QMU guest agent optionally runs inside guests, facilitates interaction either by users or management platforms so that they can run something inside the guest effectively. We'll talk a little bit actually in the killer features about an example of how that's used. But just from a basic point of view, it just provides an API for doing things that you need to get into the guest to do. So extracting the IP address, for example, or something like that, or even initiating snapshots. Venom, kind of topical. So people who saw the security advisory last week around QMU as using both KVM and Zen environments, for that matter, with the Venom vulnerability, there was mitigation of that by using S-Virt. So S-Virt is a framework for effectively defining a policy for what the QMU process is allowed to see. And if it tries to access something that isn't on that list, then it gets denied. Because it's named S-Virt, because it was originally a Red Hat project, I think people sometimes assume that this is tied in with SE Linux in a way that you can only use it on rel or Fedora. It is actually, and always was, designed with the idea of multiple security backends in mind. So it is actually something that works with App Armor as well. So most of the time, your distribution should enable this by default. If you're disabling it for some reason, probably don't do that. The VIF drivers, so this is getting more into the Nova specific side of things when we talk about the driver architecture. So there is a concept of a virtual interface driver, which defines how we plug and unplug NIC devices from the guest. What that means in practical terms is that a different interface type in LibVirt has a slightly different XML incantation to attach it. You don't need to think about this too much from a user point of view usually these days, because there's a LibVirt generic VIF driver responsible for plugging all of the most common device types. It is not plug-able easily by out-of-tree implementations anymore, which is a bit of a sticking point, but there's something that's being looked at for LibVirt that might ease some of the pain around that without making it fully plug-able. So just by way of example, when we're looking at the difference between a pass-through device and a V host user device as generated by the VIF driver or the VIF.py file, so we can see that the interface type is different, direct versus V host user. They're both using the same MAC address. In this case, the model type is vertio in both cases, but the source device is different. So in the first example, we're using the physical eth0. In the second example, we're actually using a Unix socket. Volume drivers are conceptually similar, except there's no generic driver. So there is a very, very long list of them. In this particular example, I'll just put in a few. The same idea basically applies. It's telling LibVirt what it needs to put in the XML so that you can attach that type of volume. Moving into the kilo features section, and I mentioned primarily, I'll be focusing on performance features. So some of this stuff actually started off in Juno. So some things you'll see on the list. Numera were scheduling, for example. I think, wasn't that in Juno? Well, yes. But I would position the kilo versions as the first usable baseline, where you can combine all these things together and have them play relatively nice. So we have the ability to pin virtual CPU cores to physical CPU cores, albeit implicitly. And I'll talk about that a little bit in a moment. The ability to back guess memory with huge pages. Numera were scheduling both the extension to cover memory binding, IO device locality awareness, and to be honest, just general cleanup of a number of bugs in that area. So when we talk about CPU pinning, so there's in Juno, there's a pneumatopology filter added to the scheduler to have the ability to take into account the pneumatopology the host was scheduling. This kind of needs to be done in two places, actually. So we need to be aware of the pneumatopology of the hosts when scheduling and what cores and which nodes they're in are available. But also on the agent, we need to have that same information to then build the guest in the correct way on the other side. So the pneumatopology filter was extended to add the concept of a dedicated resource guest. So unlike, perhaps, a typical virtualization environment, say, like VMware or Red Hat Enterprise virtualization, we don't explicitly, from a user perspective, say, this is the exact CPU core I want, pin my VCPU to that because we're not supposed to have awareness in a cloud environment of exactly how the hardware looks from that point of view. So to simplify and abstract that out a little bit, the concept is applied in a way that you say in your guest request that I want this guest to have dedicated resources. Then the scheduler and the compute agent are responsible for finding available cores to pin that guest to effectively. You are trading off when you do this, the ability to overcommit memory and CPU. So the pneumatopology filter and the agent code do that implicitly for you, but you need to be aware of that. So there's a performance tradeoff to be made in the fact that you're getting dedicated cores, dedicated memory, but as a result, your consolidation ratios aren't as good. And just mentioning at the end there, and I'll go through this in an example actually, you also need to combine these configurable options with existing techniques for isolating cores. So for example, if I have four CPU cores in my host and I pin guests to all of those, my host processors and my V-switch are still bouncing around somewhere inside that node. So you're not getting the deterministic performance you're probably after, unless you also pin host processors somewhere. So looking at an example, very basic NUMA hardware layout. So I'm just using NUMA control here to dump what it's seeing. So have on the left you'll see node 0 and node 1, the two NUMA cells on NUMA nodes available. NUMA node 0 has CPUs 0, 1, 2, and 3, so four CPU cores. NUMA node 1 has 4, 5, 6, and 7. In each case, they have around a gig of memory available. In some newer systems, you can also get an IO layout associated with this. So PCI devices or PCI-U lanes can be associated with a specific NUMA node as well. And you can see in the bottom there, it also shows me the node distances. So we see there that access from node 0 to node 0 is obviously closer than accessing node 1 from node 0. And that applies mainly to your memory in this particular case, but also to the IO devices in architectures where that's supported, or on chipsets where that's supported rather. So visualizing that a little bit, this is just a diagrammatic representation of that same thing. And to explain what I mean by access being slower, so if I'm on core 0 in this example, and I need to access memory that's in node 1's memory banks, then I have to go through the interchange bus, and that's going to add a performance penalty versus if I'm accessing memory local to the node I'm on. So I mentioned earlier that verse allows you to dump the XML, not just related to the guest actually, but also related to the host, so it limits representation of what it's seeing. In this particular case, this is just an excerpt. So we see here that I'm looking at cell ID 0. In this particular case, we see my 8 gig of memory exposed as kilobytes here. It also shows in the pages element. So all of my pages in this particular example are 4k. I have no huge pages, which is the next line down there. The 2048 is 2 meg huge pages. I have none of them in this example. And then at the bottom I have my series of CPUs starting with CPU ID 0, 1, and the alignment they have with specific numeros. So I'm going to walk through a little bit of an example of configuring this. So in this example, I'm going to try and set it up with that hardware layout I just showed. I'm going to set up CPU pinning based on that. So first of all, I have to enable the pneumatopology filter. And typically the aggregate instance extra specs filter as well. So pneumatopology filter is obviously taking account for my host topology. The aggregate filter is used because I want to segregate the machines on which I run guests that need dedicated resourcing from the rest of them. So the ones I want dedicated resourcing on, I need to take some additional configuration steps to make sure they're ready for that. So in this particular example, I'm going to reserve for the purposes of running guests CPU cores 2, 3, 6, and 7. So what that means is that I'm reserving two cores out of four on each of my nodes, keeping in mind that this is a fairly contrived example. But stay with me. And what the ISO-LCPUs kernel parameters are doing is telling the host kernel, basically, do not schedule processes to these cores. Now, it still will allow you to schedule processes to those cores if you explicitly pin them, which is, of course, what we're going to do. Using the vCPU pin set variable in nova.conf, we're going to express basically the same thing. So this is telling Libvert and QMU, and specifically the Libvert driver, hey, this is where I want you to run the virtual machines. So it's going to explicitly, even without any further configuration, if we're not doing CPU pinning, it will put pinning ranges into the guest XML to ensure that your virtual machines are running on the cores specified in this variable. And you can specify a range. So for example, if I was giving the entire first numer node to running guests, I would specify the range of 0 to 3, for example. And this is a visual representation of what I'm trying to do. So I'm reserving, it's kind of the reverse, because I specify that I want 2, 3, 6, and 7 to be isolated from the kernel scheduler. Implicitly, that means that the rest of the cores, so 0, 1, 4, and 5, are being used for host processes. And realistically, in an environment that cares about this kind of stuff, like a network for function virtualization or a packet processing use case, I would pin my vSwitch to one of those cores, for example. So in terms of setting up the guest, we have a opportunity to specify whether we want the instance to be dedicated. Either in this case, I'm going to use the flavor extra specifications. You can actually also do it from an image property, if I recall correctly. But in this particular case, I've pre-created a flavor called m1.small.performance. So it's basically just a copy of m1.small, but I'm adding this configuration key. So CPU policy equals dedicated. And then on the nova boot command, I simply specify my image, as usual, and this new flavor I've created to create my instance. So we take a look at the resulting guest lv.xml for the guest. So vCPU placement is static, but it would have been before, actually, as well. But what it does give me in addition is there's a one-to-one relationship between the vCPUs of my guest and the physical CPU cores on the host. So you can see here that vCPU or virtual CPU 0 is explicitly pinned to the second CPU core or CPU core 2 on the host. And similarly, virtual CPU 1 is pinned to number 3. We also, in this example you can see, have pinned what's referred to the emulator thread, so this is an additional QMU process associated with a guest. And the current implementation is pinned to the union of the allocated CPUs. So the emulator thread will still move around between those two CPUs. That's something we may look at tuning further providing other options for in the future, but that's the default implementation for now. When we go further down in the output, we'll also find that the memory is strictly aligned to the Numer node. So this is relatively new ability to do this in not just QMU, but also the kernel. And it only, if I recall correctly, will work when you're in KVM mode. So you can't use this using QMU emulation. You have to have the hardware acceleration. But you can see here that the memory backing for this particular guest is pinned to node 0 as well, or cell 0. So moving on and now combining this with huge pages. So in that previous example, my host didn't have any huge pages to find yet. Huge pages allow us to use larger page sizes for our memory, which improves our cache efficiency. So effectively, instead of requesting over and over again 4K pages, obviously if I can request once a one gig page, I'm going to have a faster lookup time. But there's pros and cons to that as well. So we're primarily dealing with on x86 two meg and one gig huge pages. So allowing these, the other thing I should mention is these can't be over committed. So if I'm using huge pages, then again my memory over commit is gone. And different workloads are going to exhibit different characteristics when using huge pages. There are workloads that benefit from two meg huge pages but may not benefit from one gig huge pages, for example. It really takes some profiling to work out exactly what the right fit is. Bigger is not always better. So in terms of the process for setting this up, an administrator uses the normal Linux tools for reserving huge pages during compute node setup, and then creates some flavors to match. Again, you can also do some of this from image properties. So just taking a look at an example here, so I mentioned I have to allocate the huge pages I want up front on kernel boot in this particular case. So I'm requesting 2048 two megabyte huge pages, which is, of course, enough to back a four gig guest, which is what I'm going to create for this example. Again, this is a relatively contrived example for the purposes of this presentation. There's obviously a lot more involved, but there's also a lot of very exotic hardware configurations once you get into some of the newer stuff as well, depending on what chipsets are in use and so on. So I use Grubby to set that up. I install to the master boot record. I re-bent my machine, and then I take a look and see where my huge pages are at. By default, when you specify the huge pages the way I did, the system will simply split them equally over the number of new nodes you have. So in this particular case, I have two new nodes, so I get 1,024 of those huge pages on each side. Similarly, we can see when I run verse capabilities that the allocation for the new node has changed. I now have 1,024 on each node. So I'm re-using my m1.small.performance flavor. I'm adding the keyword that I want the mem page size of 2048. You can also, obviously, request the one gig huge pages, or there is also shortcuts where you can specify smaller, large. But for the purposes of this architecture, it's basically the same thing, so it doesn't make a lot of difference. And then I obviously boot my guest using that flavor. And when we dump the XML, we can see that the memory backing has been changed to use those huge pages on node zero, or on node zero where possible, sorry, it's more accurate. So moving on to the PCIe example, so we're using the same host here. Most, well, more modern X86 chipsets have the PCIe lanes associated with a given new node. And it's basically the same principle as with the memory stuff. So if I'm on core zero, and I'm using a physically pass through device, and that device is attached to node one, then I'm not going to get the same performance I would if it was attached to node zero. So again, some extensions to the pneumatopology filter allow it to make use of this information where it's available. For chipsets that don't expose it, it will make no difference. So that was kind of the wrap on the performance features. I have also started, and I'm working on a series of blog posts to work on those in a more expanded basis. But now I'm just moving on to some more general features in the driver for this Keeler release. So Libvert 1.25 and higher, or greater than 1.25, sorry, supports the ability to use something that was recently added to the QAMing guest agent. So that's as a reminder the agent that can run inside the guest and perform actions on that side. And it's what's called a freeze-thor API. So what that means is that I can tell the guest I want to freeze the file system. So that can take a snapshot and ensure that snapshot is consistent or queues the file system, sorry. And then when I'm done snapshotting, I can throw it again to allow the guest to start modifying the disk again. So there's a very useful ability for creating consistent snapshots or ensuring consistent snapshots currently, the way to enable that. So when we have the guest agent inside a guest image, we need to set the QMU guest agent image property to yes to show that it's there so that Nova knows it's there. Because we don't really know if at this time have a lot of introspection inside the guest yet other than what we get via these agents. And for the guests where we want to use this capability, we need to set the require FS Freeze Image Property to yes. If these things are set up, the when snapshotting, this will be implicitly taken care of for you. So there's nothing to do after that point. So it's primarily a guest creation and setup thing when you're creating the image. This was actually not a blueprint, but a bug fix or treated as a bug fix. So Hyper-V supports a number of additional para-virtualization features, so particularly for Windows. So it provides an enhanced experience for Windows guests running on top of Hyper-V. And actually Linux guests also support those these days. So there's contributions in the kernel that allow them to take advantage of these features when they're running on Windows hypervisors. On Livert KVM on Linux, we have the ability to emulate a number of these features. And in particular, and the reason this was primarily treated as a bug fix is that the certain Windows versions, when they're running on heavily loaded hosts, you can encounter a blue screen of death if these particular features aren't present or exposed to the guest. So to take advantage of those, we expanded the behavior of the existing OS type equals Windows image property. So there is an image property you can set which says this guest is of type Windows and then some customization of the XML on the virtual hardware presented to that guest will be done based on that. Support for Vhost user. So Vhost user is a new type of network interface implemented in QMU on Livert. It's intended to provide a more efficient path between a guest and user space switches. It becomes particularly interesting, not just in normal OVS use cases, but as we look to things like OVS and DPTK acceleration when they combine together. So the VIF driver is there at the moment. And I'll talk a little bit more about in the Liberty look forward, what more we need to do around that. So Liberty predictions. So I mentioned at the start, and I'll highlight again that this is a fairly critical time in a cycle for determining what we're actually gonna achieve in the next six months. So these are all in varying states of discussion and review. And this is just kind of my read on stuff that looks like, I'll say it's likely to happen. But this is again, just me speculating primarily. So Livert hardware policy from LibOS Info. So we touched on the OS type image property for telling Nova what type of guest is running. Now we said for that particular Hyper-V Enlightenment example that we're saying OS type equals Windows. And the obvious question when you think about that is, does that mean Windows 3.1? Does that mean Windows XP? Windows ME, probably not, and so on. And not just in the Windows world, but there's also some versions of BSD and other guests are also very sensitive to some of the timer information. So we wanna get a little bit more granularity in what we're doing there so that we're more explicit about exactly what guests we're talking about. Because obviously, when we look at something like XP now versus the latest versions of Windows, the workarounds we had to apply for XP are potentially quite different from the ones we need to apply for newer versions. So LibOS Info provides kind of agreed upon form for expressing this information. So the idea is first to expand the use of the OS type variable or image property so that it can at least take this format and use that format. Ideally, what we'd like to get to and probably beyond to Liberty Cycle, to be honest, would be introspection to actually be able to determine the guest without the user having to explicitly tell us. But that's kind of an extrapolation of what's proposed currently. Postplug VIF scripts. So I mentioned that the VIF driver interface is not pluggable because it's not really a stable API to begin with. And this causes some issues because it means if someone, if a networking vendor wants to out of VIF, they have to land it in the nova tree before anyone can really use it at the moment or they have to apply patches to people's distributions and so on. And one of the main reasons they're typically doing this is not because they're not using one of the already defined interfaces, but because they have to do some additional customization or calls to their V-switch or setting up flows or that kind of thing after plugging the interface. So to get around that or to help them find a solution that's more in the middle, what's proposed at the moment and what's been discussed a little bit and still under review is the idea of allowing for a postplug script so that people could use, for example, the generic V-host user VIF driver, but then run some additional logic that they can plug externally to do whatever they need to do on their V-switch. There's a lot of discussion around further work around SRO-V device pass-through. And there's a number of other blueprints related to this, but the primary two that seem to have the most interest at the moment are the ability to attach and detach pass-through interfaces from a running guest. So at the moment, you have to attach the device or the port at the time you're booting the guest. You can't change it afterwards. And also the ability to do live migration of guests with SRO-V devices, but there's a little bit of a catch there in that live migration of those guests is only gonna work where we're using the Mac V-tap pass-through mode. We're effectively creating a virtual device in between. It won't work when we're doing direct pass-through and Mac V-tap is a bit of a performance penalty involved. So it's kind of a trade-off. You can have maximum performance, but you're trading off live migration and that applies to a few other things in NOVA as well. The ability to be more explicit about what CPU model and or features a guest wants. So there is actually a filter, a schedule filter at the moment that can be used to say, this guest needs to run on a host with these CPU features. The problem with that is that it doesn't actually check what host model that particular node is exposing to the guest. So all right, I landed on a host that has AEX or whatever feature it is I wanted, but I may not be getting exposed to that in my guest anyway. So there's a couple of proposals and a discussion around how to resolve that issue in a way that we have what we actually intended, which is the guest not only goes on a CPU that has that feature, but that is exposing that feature to virtual machines. Virtual Machine HA, I put on here because there's a lot of interest in it. It's not actually primarily an inside NOVA thing to implement, but there is some discussion about having the ability for external tools like Pacemaker or KeepoliveDE or other high availability tools that are already well-established as ways to detect failure, the ability to tell NOVA, hey, this failed and what they're doing in reaction to that. So the bulk of the HA work is actually in Pacemaker and KeepoliveDE and other high availability solutions and the way people are deploying those, but there is also this proposal for an API call in NOVA for them to tell it what it's doing. There's a couple of little tweaks around Vertio performance enhancements under discussion. One of the challenges with those is how to abstract those enough that you are giving a good experience to the generic cloud use case and also enough customizability to people really pushing the boundaries and it's very challenging with some of these because they are, for example, Vertio multi-qs is a way to enhance performance in terms of increasing the number of queues associated with a Vertio device for processing traffic that is very specific to the Vertio driver and as a result, in a cloud model, it's difficult to abstract that enough. I should have mentioned that as I go down these, I get to the ones I have less confidence in, just from my own take. So hot resize I put up here, which is kind of like not the first time it's been proposed. So this is the idea of most modern hypervisors, if not all, have the ability these days to add and remove CPUs, RAM on the fly, so at run time. And the idea is to implement an equivalent of the resize function, or cold resize, to do it on the fly as well in Nova. Whether or not that happens or not, we'll see. So that's the end of what I have for today. Again, if people are interested in breaking up, blowing up compute itself I'm taking a look at the wider architecture. I'll be back here the same time tomorrow doing that. I'll also take any questions people have.