 Hello from a very wet, very miserable part of the world, the West Coast of Ireland in late September, early October. My name is Stephen Friggen. I'm a senior software engineer at Red Hat. And for the past six years or so, I have been working on the OpenSec Compute project. The topic of my talk today is the magic of delivered driver. So over the course of this talk, I'm hoping to give a very high level overview of what the delivered driver is, where it fits into OpenSec, and what the implications of using the delivered driver are, particularly from the scheduling perspective. The delivered driver has been around for maybe eight years or so now. And at this point, it is by far the most used, widely spread driver that Nova offers. But because of just the sheer amount of complexity and features that it offers, it can be a tricky beast to operate effectively. So I'm hoping to look at some of the kind of issues that you would see when you're using a deployment with your typical configuration, i.e. the delivered driver, and how you can go about working around on resolving those issues. Before we get into any of that deep dive stuff, I think it would benefit us to have a quick recap of what OpenSec Nova is, what the architecture of Nova looks like, and where the delivered driver fits into that. So Nova's the Compute project. It's responsible for managing the lifecycle of VMs or instances in OpenSec terms. And if you've ever read the Nova documentation, you'll probably come across an image just like this one on the screen. There's four main components within Nova, and there's a couple of external components that Nova relies on to handle things like images and authentication and so forth. But focusing on those four components, the four of them are the Nova API service, the scheduler service, the conductor service, and finally the compute service. The delivered driver, as we're going to see, is a sub-component of the last of these, the compute service. But its usage has implications for both the API and the scheduler service as well as the conductor service in a manner. When delivered actually fits into this, or any driver for that matter, into this diagram is underneath the Nova compute service. So the Nova compute service has a manager, a compute manager, and that is communicated with via an RPC API from the Nova conductor service, and in some cases other Nova compute services. So you have that compute manager, and that talks to the vert driver, which is an abstraction of whatever your chosen hypervisor driver may be. And then you have a whole load of other stuff underneath there that from the Nova perspective we tend to try and ignore wherever possible. Nova comes with a couple of different compute drivers. The liver driver is by far the most widespread widely used of those drivers. But I like recent and so forth, these are all fully supported entry. The liver driver wasn't the original driver that Nova supported, that was the Zen API driver. That driver has been deprecated in the train release, as far as I recall, and was finally moved over the course of the last cycle, which was the Victoria release cycle. Levert itself, of course, isn't actually a hypervisor, it's an abstraction layer itself over different hypervisors, and Nova offers the ability to configure the actual hypervisor that liver is using. Again, there's a clear winner here, which is the QMU and KVM accelerated QMU backends. Those are the things that are used in the gates, and those are the things that are used in the majority of deployments used in the liver driver. I said earlier that from the Nova perspective, we tend to try and ignore everything below the bird driver wherever possible. There's enough complexity to be dealing with in the higher levels, but for anyone that's interested, a very high level overview of that black box, if you will. You have your Nova liver demon service, which is responsible for starting up and managing a couple of QMU processes. Again, assuming you're using the QMU back-end, and all of that sits and runs on top of a Linux kernel, Linux Autoist. What Levert expects is this big blob of XML. If you were to think about Nova for some very simplistic terms, or that the Nova compute service is doing really is, or most of the time anyway, is converting a request from a user into just such a blob as this that it can feed into Levert. It does a lot more than that, obviously, but simplistic terms. And Levert then is, again, very simplistic terms, converting that XML into calls to the QMU executable, and a couple of other things as well. Massive, massive simplification, but good enough for now. So what's so special about the Levert driver? What does it do that the other drivers don't do, particularly from the perspective of scheduling? And to be honest, it pretty much comes down to complexity. So the Levert driver, as I said at the top, is quite complex. It is the most feature-complete of all of the driver's supported entry. And it could be a bit of a, like, Rube Goldberg machine at times, with a whole load of stuff that's been tacked on over the years, maybe not fully thought out, that kind of thing. Stuff that we are in fairness continually trying to clean up and improve and refine, if you will. So complexity is, unfortunately, a component of an aspect of dealing with the Levert driver, and I guess I hope it stack us all. So my focus here is on not so much giving us about the Levert driver, because, again, I help maintain this. I don't want to give out about the thing. But where the kind of conditions that might cause this very famous Novalid Host error, that anyone that has ever operated in an OpenSec cloud, which you really have seen all too many times. So that Novalid Host error is a thing that's raised by the scheduler. So it probably makes sense, rather, to start from there, when we analyze things from that perspective. So the Noval scheduler process is quite a simple service compared to the other services. It is responsible for taking a request from a user, pass through various other proxy services, and going and finding a host that you can actually schedule instances on. So the way that it actually does this, there's kind of a three-step dance that's necessary. The first of those, once it's received a request from the user via other services, is to go and talk to placement. So any recent OpenSec release does exactly this. Placement for anyone that hasn't encountered it before is the inventory management inventory tracking system within OpenSec. And that's responsible for tracking, for example, how many VCPUs, how much disk space, how much RAM is available on each given compute node or each given server within your data center. Placement will get a request from Noval scheduler and it will return back a list of allocation candidates, or places where it thinks Noval can schedule instances to. The scheduler there will filter these based on criteria that perhaps placement doesn't understand. And then after it has filtered out the DUDs, if you will, it'll go on a little way, those and choose a clear favorite and then a couple of backup alternates that it can use in case scheduling to the clear favorite doesn't work out. So those three components, that's how we're going to look at the LiveVote driver and the implications of using the LiveVote driver from a scheduling perspective. Because it's a good way to do things. So starting from the top with the last of those components, the Weirs. So there isn't actually a whole lot to say about these. The LiveVote driver doesn't really have any special bearing when it comes to Weirs because Weirs aren't really that complex themselves. Weirs don't exclude hosts, they simply illustrate a preference for any given host. And as a result, a bad misconfiguration of Weirs will typically result in maybe poor scheduling decisions, but it won't result in your Novaled host case. Nonetheless, there are three Weirs that do have some implications. The CPU, the RAM and the PCI Weir. CPU Weir, as the name suggests, helps you either stack or spread on hosts depending on the availability of CPUs. The RAM Weir, again the same for RAM. And then finally the PCI Weir, which again, as you would suspect, is for PCI devices. I'm not going to go into too many details yet because we're going to cover each of these three topics individually in a moment. But these can behave sometimes funcally depending on how your LiveVote driver is configured. And we'll see that in a moment. Placement is where things actually get interesting. So placement thinks about the world in terms of two things, resources and traits. Resources are something that a compute node has multiples of. And traits are something that a compute node is able to do. This is, again, from the perspective of Nova, placement can be used to track things other than compute nodes and compute node inventory. But we're simplified here. I'm focusing on resource providers, things that provide resources that our compute nodes. So there's three things that I kind of want to look at here, actually four. VCPU and PCPU. VCPU being floating unpinned instances or instance cores. MemoryNB, which is how we track our memory consumption and memory availability. And then a final one, which isn't listed here, which is diskGB, which is how we track availability of disk. And then from the traits perspective, again, these are things that the compute node is. There's a whole load of these. The three I have listed here, the first one demonstrates that this compute node is able to support emulated TPM devices. The second reports that the compute node is able to support AMD SEV memory encryption. And then the final one reports that the hypervisor, in this case, is able to support instances that request the virtual disk bus via image metadata property, for example. So problem number one, when it comes to the live phone driver. Liver has two different ways of thinking about CPUs. So for, since we're at the attacker cycle, we've supported this idea of CPU policies, which allow you to say instead of my instance floating across all cores on any given host. I want to delegate, dedicate rather specific cores from that host to the instance, and no other instance should be allowed to use those cores. That's configured via a couple of config options. These replace older, deprecated config options listed here on the screen. And from the actual live phone driver perspective, again, live phone just cares about XML blobs. That's implemented with something like this. Not really all that relevant to the discussion here, but you just want to know. So there's a whole load of potential gutters with using this feature and things that will impact your ability to schedule instances. So the first of these is that in older versions of Nova, there was nothing to restrict shared CPUs from trading all over the dedicated CPUs. So we pinned dedicated CPUs using that XML blob. I'll show it on the previous slide, show it now. But we didn't do any such thing for shared CPUs. Which makes sense because you'd have to continuously re-pin your shared instances as you booted a pinned dedicated instance. But it did mean that you had to use something like host aggregates to separate CPUs. Another thing that we saw was you have this idea for CPU of overcommit ratios. When you're using pinned CPUs, those overcommit ratios don't apply. So you'll have done your configuration, you'll have expected your CPUs to overcommit by 10, for example. You have 10 whole CPUs, therefore you should be able to boot 100 instance CPUs. Or boot enough instances to consume 100 instance CPUs. That doesn't happen with pinned instances because, again, no other instance can share them except for the misbehaving shared CPUs. And then finally, there were a whole load of woes and misery around live migration. Typically caused, well, actually entirely caused by power inability at the time to recalculate this XML blob. So you would live migrate your instance to another compute node and it would stay using the pin on the previous compute node. Which was fine if you saw a nice clean failure where those CPUs just weren't available. So you pin cause 19 and 20 of one host and then the other host only had 16 cause. You'd see a nice clean failure. But if you didn't see that, then there was a good chance that you'd end up like scheduled over other cause. The recommendations to avoid all of these issues when it comes to scheduling mishaps. Please, please use the newest versions of Nova train on later because these have improved how we do pin CPUs like significantly. In train, we introduced this idea of PCP use, which is something that you can't over commit. They're completely different info inventory type. And they give you the ability to boot pinned and unpinned instances on the same compute node by maintaining two distinct pools of CPUs. On top of that, this is seen as one of those advanced features that you should really only use if you actually need it. Typically, if you were to enable guess new metapology issues or something we'll touch on later, they'll give you quite almost as much of a performance improvement as using any itself or simply just go lower your over commit ratios. I do neither of those things using the newest version of Nova or using an alternative thing should help minimize the amount of issues. You'll see where you either won't be able to schedule to a host because it will fail late in the process or you'll be able to schedule but it actually isn't doing what you think it's doing under the hood. And just like CPUs, memory has the exact same problem. Not all memory is the same from the liver driver perspective. So again, this is another one of those advanced features. And if you're not talking in terms of high performance workloads or anything, you might never have even encountered this. But liver understands huge pages and the implications of huge pages for instance. So for anyone that hasn't encountered huge raises before, they are simply larger memory pages. There's massive performance benefits to be had from using whole huge pages wisely, particularly for things that consume large amounts of memory. But as we're going to see in a moment, there's some downsides to using them. The way you'd actually use that is enabling this extra spec. On top of that, you also have different types of, let's say disk space, so memory and also storage. So you can use in recent versions of Nova, something like an SSD to actually do the backing of your memory. So instead of using system RAM, you can use a file on a disk, so file back memory as this example shows. And that obviously has impacts in tracking and inventory management and stuff for this stuff. And again, bringing it all back, this is all stuff that can cause no valid host ever conditions. So potential gotchas from a high level of memory consumption and kind of the weirdness that liver can deal with memory consumption. Typically, if you're using anything other than standard pages, you're giving up the ability to swap and you're giving up the ability to over commit. So even if you have over commit configured on the host, it will just be blankly ignored. And you'll see that unfortunately not from placement because placement will return your allocation candidates, but rather later on in the schedule as we're going to see in a moment. From that perspective, from that point, counting of huge pages and standard pages tends to be quite poor. Lever even now doesn't actually support tracking huge pages in placement, the liver driver there is, which means that placement only actually still understands memory MB as a unit. The end goal for this is to implement something akin to what we've done with VCPU and PCPU where they are separate resources, but just hasn't happened yet. And then again, as with using dedicated CPUs, there are a whole load of woes and misery and pain to be had with live migration, if it isn't thought out properly at a time. Even down to not quite data loss, but out of memory killers running on the host and deleting instances on you. It can be quite bad. So the recommendations we have for this is to use firstly Nova 19 Raki on newer. There's a whole load of fixes around huge pages and stuff in here. They were backported to earlier releases, but this is the first version with all of them in from the get go. And something that we no longer have to recommend, thankfully for CPUs, but we still do if you're using different types of memory backing, are host algorithms. So if you have hosts where you're using something like fileback memory or you're using huge pages, use host aggregates to separate those different types of hosts. Not doing this is just setting yourself up for misery from the scheduling perspective and also just things like out of memory killers and stuff running in the background. And then forgetting about CPU and RAM and disk going down is kind of the more low level stuff. There are differences between the different hypervisors that live at supports, the advanced functionality that you can enable, the differences if that's enabled on one host and not on another. And then the actual host platform itself, you have an Intel processor in one platform and you don't have it in another, you have an AMD chip. So things like this, different host CPUs have different features, enabling advanced functionality selectively on one host and not another can break stuff, that kind of thing. And again, live migration is one of the areas where you're typically going to see a lot of misery. Again, no valid host on this. You might see it if, for example, you're doing things that placement supports like requesting a particular image type that isn't supported by your hypervisor because we report that stuff in placement now. You might also see it coming up in scheduler filters later. We're going to touch on scheduler filters in a moment, but there's just a whole load of different ways of enabling these things and especially doing different things on different hosts can just cause misery. So don't do it. Basically, that's our lesson here. Prefer homogeneity where possible. If you're going to mix in a matching host, either make sure that you can boot host and that they'll correctly schedule, or boot instances rather, and they'll correctly schedule to the right host or ideally use host aggregates again, like I said, to separate the different kind of workloads into different hosts. And as always, with all this stuff, you test it because there's only so much stuff that we can actually test in the gate and unfortunately, a lot of these advanced features tend to slip under the radar. We have tried to close those gaps in recent years, but there is only so much that we can actually test without spending days and days testing each individual patch. So if you're interested in a particular feature, test it in a lab and make sure that it works and then slowly start scaling it up. So that's placement. The last of the three components to consider are the filters. Again, there's a load of filters here, just like the wires, but there's only a couple of these that are actually going to perform differently based on the back end. I'm going to talk about two of those, two kind of things that I know best and the two that are probably going to cause a most heartache from a deployment perspective. And that will also cause the most kind of no valid host errors. First of those is the pneumatopiology filter. All of that and the PCI filter we're doing is they're basically taking these two objects that aren't really relevant at the moment and they're doing some conditional checks on those and they're returning a boolean value of true or false depending on whether the host is acceptable or not. The pneumatopiology filter is the thing that's responsible for enforcing NUMA affinity. For anyone that hasn't dealt with NUMA before, it is the thing that allows you to have basically multiple memory controllers on a compute node. The reason that NUMA is important is that where you have multiple memory controllers on a compute node and you have instances running on a CPU that's talking to one memory controller, trying to access memory on another memory controller, can incur latency by performance impacts. That's enforced with liver, that's still not that relevant at the moment, but NUMA itself has implications for a massive amount of kind of more advanced features within NOVA and the liver driver such as those listed here. So it is unfortunately quite a big topic and a big thing that you need to think about when you're deploying Oversack. Issues with the pneumatopiology filter. Firstly, it's racy, so filters in general tend to be racy and doing things like spamming multiple requests means that the pneumatopiology filter might allow multiple requests to use in instance or to use a host and you'll get late failures. There's also a whole load of like little bugs that are in the background for a few years. We've worked on closing a lot of these off like the inability to live migrations stuff in recent releases, but some of these still exist. Our recommendations for minimizing damage when it comes to the pneumatopiology filter are try to limit the amount of spawn requests or increase the amount of retries that an instance is able to make if possible. And again, if you've got different compute nodes with different capabilities, use host aggregates to separate them. The other one of the two that I want to talk about is the PCI pass-through filter. For this, we can only really say that we're sorry. It is a miserable, miserable thing to configure and we've fully realized that. It's hard to debug. Move operations don't tend to work all that well. It turns out that attaching and detaching a PCI device it's about the only way you can actually move these things around and that is indeed how recent versions of NOVA allow you to do SRIOV based instances like migration. But again, like the pneumatopiology filter it's also racy. It'll frequently give you a compute node that doesn't actually have any PCI devices left if you try to boot multiple instances at the same time. That kind of thing. And recommendations-wise, we've done a lot of work in bug fixing here in recent releases. Train is a good starting point. This is the release that also introduced the ability to, for example, like migrate instances with SRIOV PCI-based NICs. We also recommend that if you're using PCI devices that you start with a very small deployment to get things working and build up from there. NOVA supports the ability to have multiple devices with kind of different configuration and stuff. But going to how complex this thing is and how hard it is to debug and start and small will make your life a lot easier. And then finally be prepared to go and unfortunately hack on the code a small bit and add additional logging. This is an area where logging and debugging has historically been quite difficult. And if you happen to hit upon something please let us know because it's an area that we'd like to improve. Going forward, we're hoping to move some of this stuff into placement which should decrease the amount of old servers that you'll see when using the PCI-PERS-2 filter. But that's a long road and it'll take a couple of cycles to get there. So a quick wrap up of the things that I've been talking about quite rapidly. So the liver driver is complex, unfortunately, or fortunately, depending on your use case. And because of that, there are a great load of kind of edge cases, corner cases, outright bugs that you'll likely to encounter when using it. To minimize those, using like a homogenous deployment will unfortunately just make your life a lot easier. Enabling features across all compute nodes using similar types of instances even will just make life easier. If you can't do that, we highly recommend using host aggregates to break up your data center or your deployments into more manageable chunks. Move operations are the things that are most likely to break. You'll attempt to move something and it won't be able to find a compute node or if it does, when you boot the instance there, it won't actually boot successfully. So these are the things that we recommend you test and in general, we recommend starting small and testing these things as thoroughly as possible. And that is the magic of the liver driver, a veritable wish list of bugs that we would like to fix at some point in the future. Hopefully that's beneficial to understand some of these more advanced features in Nova and why they're performing funcally. And we'd recommend, we'd love if you actually got in touch with the Nova developers if you encounter these and have stumbled upon strategies or even fixes to help mitigate their impact. Other than that, thank you very much for your time and I'll get back to looking at the way in out the window.