 Okay, it looks like we're ready to go here. And thanks everybody for showing up to the very last session of the conference. Hard to believe it's been a jam-packed four days for me. I know that. And so I really appreciate everybody putting some energy into this last session and joining us here. If you're at the back, if you want to move forward, that would be great and be a little bit more interactive for us. My name is Ian Jolliff and I work at Wind River. I'm the product architect for our titanium cloud family of products. And I'm Chris Friesen and I am a senior member of the technical staff at Wind River. I contribute upstream mostly in the area of NOVA. Yeah, so Chris does a lot of work upstream in NOVA and we've been having some really good conversations while we've been here with the NOVA and other teams. Just some people might know not much about Wind River, but traditionally we were an embedded software company, but we also have developed products for networking and other verticals, automotive and things like that. One little known fact, we actually have software running not just on Earth, but on Mars. Some of our software actually ran in the Mars rover. So, and also I think was mentioned in the guy who played Jason Bourne, what was his? Anyway, the recent movie about Mars. Martian, Martian, that's it. Some of our areas of interest, certainly in the networking side, is how to really get predictable performance with NOVA. We've been working on various telecom and IoT workloads. All the way from 3G, 4G, and starting to get into some 5G applications. Certainly seeing lots of interest around how to get predictable performance for applications like CloudRAN and content delivery networks, as well as virtual CPE applications. So, we've really been involved in NFV solutions from day one and certainly my personal background is in telecom for many years. I'm sort of very interested in what's going on in the mobile edge computing space right now, or multi-access edge computing. I think that's gonna be very interesting, but there's also a lot of applications today that are very transformative and can get some neat capabilities when you start running them on cloud applications. So, our focus is really developing solutions, cloud solutions for critical infrastructure applications. And as Chris was saying, we're significant contributors to OpenStack and really huge fans of OpenStack. What we're gonna talk about today, though, is how do you get predictable performance and drill down a little bit in terms of what's going on under the covers. We're gonna actually do a walkthrough of launching a VM, show you all the steps that happen, understand some of the memory, many considerations that you wanna make to get the best performance possible out of your application. And then we're gonna talk a bit about how to configure, show some of the commands to actually get that predictable performance out of OpenStack. So we'll highlight some of the effect factors that you wanna consider and talk a little bit about CPU memory and NUMA awareness and how you can leverage some of those capabilities. So with that, I'm gonna turn it over to Chris to walk you through the instance boot. All right, thanks, Ian. So the booting of an instance in OpenStack is a multi-stage process. There's a lot of components involved. And so we'll walk through each of those components in a certain level of detail and then dive down more into the novice side of things because it has the most complexity here. This is sort of a high-level picture of what the chain looks like. So we start by putting the image in glance, we create the networking ports, we create the cinder volumes, the scheduler selects the host, and then the host launches the instance. We're gonna make a few assumptions just to simplify things because otherwise this talk could be really, really long. There is actually a really in-depth talk that went on earlier this week from some of the Nova guys where they actually have the logs of a boot of an instance inside of Nova and they go through and just go into serious amounts of detail. So if you are interested, I suggest you check that out. In this case, we're gonna assume that OpenStack is up and running, that we have configured infrastructure and provider networks, that the cloud admin somewhere has defined some flavors and that the user has created some key pairs so they can actually log into the instance once it's up and running. We're also going to assume that the compute node is running libvert with QMU for the hypervisor. Most of it will be true if we're other hypervisors, but some of the final stages are a little bit different. So basically, we're ready to go. So the first stage that we have to do is to create the glance image. The top line here is the command line interface for actually creating the image and there are a whole bunch more options that you can add on here if you wanna specialize this, but this is sort of the bare minimum needed in order to make an image in glance. So I'm gonna start with a file that I have locally and I'm telling Glance that I wanna make an image from this particular file. I'm gonna specify the disk format, which might be raw, which is the default or more commonly you might have something like QCao2. And then I'm just gonna give it a name and that name is purely for making it more human readable. It's not actually needed, it's optional, but it makes it easier to keep track of your images if you give them a name that you can recognize. When you do this, Glance API will process the incoming HTTP request. It will do some validation of the request. It will load the specified file and there are other ways that you can specify the image so you can tell it a location to get it from, for example. You can make your image from a Cinder volume, there's different options. In this case, I'm just gonna do it from a file. Glance API will then store the file to the backend that Glance has been configured with. The Glance registry will calculate where it needs to or store the metadata about the image. And then Glance API will return back to the caller and, oops, sorry, give them a Glance image UUID, not a Cinder volume UUID. We'll give them a Glance image UUID back to the caller. The second stage will be to create a Cinder volume. In this case, we are going to create the volume from the image that we just created in the first step. So this is again the OpenStack CLI command to make the volume. I'm specifying the image. I'm specifying the type. I'm specifying the size of the volume that I want. And I'm specifying, again, a human readable name for the volume. In this case, Cinder API processes the incoming HTTP request, does some validation, and forwards the request on to the Cinder scheduler via an RPC call. The Cinder scheduler will select the volume store where the data will actually go based on the volume type and some other factors that aren't important right now. And then forwards the request on to Cinder volume for that volume via another RPC request. Cinder volume will communicate with the backend in order to allocate the space. It will then download the image from Glance via an HTTP API call. And that will involve Glance API and the other Glance components. And then it will copy the image data into the backend storage. Cinder API will then return a Cinder volume UUID back to the caller. The next step is to set up our networking. So this is the command to create a network port. We're gonna specify which network we wanna create the port in. We're gonna specify the vNEC type and we're gonna make, again, a human readable name. So the type would be normal for VertIO, which would be the common case, or direct for SRIOVVF pass-through. There are some other options if you're doing a full physical function pass-through and there's some other more complicated options as well. Neutron server will process the incoming HTTP request, will allocate the port, and will associate the port with the specified network. And then the Neutron API will return the port UUID back to the caller. So once all of those stages have been completed, now we're ready to actually tell NOVA that we wanna make the instance. So this is the command to actually create the instance in NOVA. And again, this is sort of the bare minimum options that you would wanna specify. The number of options for this command is really long. So there's a whole bunch of possible ways of customizing this and being more specific about exactly what you want. So here we're gonna specify the volume, we're gonna specify the flavor, we're gonna specify the key, we're gonna specify the network port ID, and we're gonna specify, again, a human readable name. NOVA API will process the incoming request, we'll do some initial validation. Are you starting to see a pattern here? And then it forwards it to NOVA conductor via an RPC call. NOVA conductor will then do some host keeping and forward the request onto NOVA scheduler via an RPC call. So NOVA scheduler will then pick a compute node based on potentially many factors. So it's gonna look at the available resources on the hosts. So things like CPU, RAM, local storage, PCI devices, all of the various resources. It's going to look at the image properties and the extra specs. There are quite a few things that can overlap between the two that can be specified in both of them. But here we're talking about things like host aggregate metadata keys, virtual topology information. So things like how many CPUs do you want in your guest? How many NUMA nodes do you want in your guest? Do you want huge pages? If so, what size of huge pages? Do you want the specific hypervisor type? That kind of thing. We can also look at the host aggregates that are available and the availability zones, which are ways of grouping compute nodes together. And finally, it can look at scheduler hints. So this would include things like server groups, which might have an affinity policy. So this specifies either affinity or anti-affinity and controls how you group this instance relative to other instances in that server group. The scheduler will then weed out hosts that don't match the filters. So it throws out all the ones that it knows would not be suitable. And then for the ones that are remaining, it will analyze them all via something called the WAIR. And this is used to determine of the ones that are suitable, which one would be the best one to put it on. And people in different deployments may choose to customize this WAIR. There are some NOVA config options that you can set that will adjust how the WAIR calculations are performed. Finally, the result of the scheduler is returned back to NOVA conductor. So at this point, we have selected a host for it to run on. So at this point, the scheduler will then go and look up which compute node is in charge of that particular host. Now, in the simple case of Libvert, the compute node and the host are actually the same thing in something more complicated like VM where you may have one compute node that is in charge of many hosts. So in the more general case, we find the host and then we go and we find the compute node. We will then pass on the creation request to the compute node via an RPC cast. Previously up until this point, we've been using RPC calls which have to return back. A cast is an asynchronous operation where we fire it off and then we forget about it basically. So we then return from here back to the HTTP API caller and the rest of it proceeds asynchronously. So the next step on the compute node itself is to allocate the resources for the instance. So these could include itemized resources which are things like dedicated CPUs or specific PCI devices. This also includes countable resources. So shared CPUs, RAM, disk, that kind of thing. We would then start the asynchronous network allocation by calling out to neutron and then that triggers some work in the background on neutron while the rest of this proceeds. We'll set up the connection to the Cinder volume storage backend and we'll call out to Cinder via the HTTP API for Cinder to attach the volume to the instance. So at this point, the resources have been allocated for the instance. This next step is gonna be libvert specific and it'll be slightly different if you're using other hypervisors. So we'll create a directory for the instance, creating a libvert XML file which will be generated based on the instance configuration. Any local disks that we need, so ephemeral disks or swap disks will be created at this point. We'll pull neutron via an HTTP API to make sure that the ports are ready. We'll configure the local virtual interface for the instance. So this would be vSwitch, vHost, that kind of anything that's needed at this point to make sure that the networking is ready to go. We'll set up any firewall rules that are needed and then finally it makes a call out to libvertd to actually start the virtual machine. And at this point, your instance is running. Very cool, that didn't take too long at all. All right, so let's talk a little bit about how to take advantage of what we've just learned from Chris in terms of what's happening under the covers and how do you take advantage of some of these capabilities. So what we're going to talk about is how the CPU contention can impact performance, how well-behaved applications or badly-behaved applications or VMs on the same host can impact the overall solution performance, talk about cache hits in the cycles. Then we're going to move into memory and what some of the options are that can help you get better memory performance and better performance of your application. Talk a little bit about networking. Obviously, the high throughput and low latency will be absolutely critical for very high performance applications. And we're not going to talk about storage because it really deserves its own talk and we'll probably propose one for the next summit. I think, Chris, I think that would be a good one. So you're going to talk us through the CPU stuff. Right, so in the basic idea of if you have the default install for OpenStack, it will default to 16 times over commit, which means that you're allowed to have 16 virtual CPUs running on a single host CPU. So as you can imagine, this can cause CPU contention between the guests. If they all try and do something at the same time, they're not going to get the full amount of speed that they would otherwise. So the way to reduce this is to either reduce or disable CPU over commit. You may also want to set the CPU thread policy in which this decides whether or not you are going to allow anybody to run on other hyper thread siblings of the same physical core and you would want to use dedicated CPUs. There's another issue with CPU cache contention. So this is where you have multiple instances or even one instance, but multiple virtual CPUs within the one instance that are contending with each other for access to the cache on the host CPU. And this can be dealt with, again, using the CPU thread policy, or there's also some work being done around cache allocation technology. Yeah, that's also called the broader set of functions there as RDT. So I think that's something that we're pretty excited about. Finally, there's something else that we need to be worried about, which is CPU contention between the host and the guest. So things to be concerned about here are system management interrupts, which can cause significant latency spikes when the host decides, when the host BIOS specifically decides that it needs to do some housekeeping work and so it just interrupts everything else to run whatever it thinks it wants to run. In order to get around that, you might need to tweak the BIOS or in the worst case, you might actually have to pick different hardware. The other issue is stealing of CPU cycles by the host in order to run whatever other host work it has, whether that's kernel threads or other host processes or even other virtual machines. One of the ways around this is to dedicate specific CPUs to the host and then reserve the rest of the CPUs for the instances to run on. So there are Nova config options and kernel boot arguments that can be used to isolate the host work and the management work onto a subset of the CPUs and minimize as much as possible the amount of housekeeping work that's being done on the CPUs that are being used to run the guests and that way you reduce as much as possible any interruptions by the host, leaving the guest free to use the full CPU for itself. So this is a couple of the little bit more detail on the hyperthreading policies. This is being specified in the image property or in the Nova flavor extra spec. There are three possibilities that you can set it to. Prefer which is the default where it will try and give you a require but if it can't it will just let it go through. Require is where all of the virtual CPUs from your instance will be placed on siblings of the same host core and isolate which enforces the use of only one hyperthread sibling in each host core. The other sibling if there is one will be unused by any other VM. So most of the time, if you care about maximizing your performance, isolate is usually the way to go. It's possible to have applications that do best with the require option but you probably wanna test it and see what the results are for your specific application. The key takeaway here and we did a talk yesterday with some benchmarking results that show the impact of some of these things. And so the key takeaway is that you only have one back end pipeline for your host physical core and if you are doing things that are where both of the threads are trying to make use of the same resources in that pipeline, it's gonna impact your overall throughput. So the second lower level set of details here about configuring these things is the CPU policy which can be set to either dedicated or shared. By default, this will be set to shared. If you care about performance, you probably wanna set this to dedicated which will give you a full host CPU for each guest vCPU. If you want to get down into the even more gory details of this, you can specify how many NUMA nodes that you want in your guest and the reasons why you might want to do this generally involves memory bandwidth because it gives you access to memory from both of the or from all of the physical sockets. You may also want to do it just for arbitrary reasons for breaking up your CPUs into different NUMA zones. So you can specify how many NUMA zones you want in your guest. The caveat here is that your host has to have at least this many NUMA zones. Once you've specified how many NUMA nodes you have, you can specify which CPUs you wanna put into each of your NUMA zone. And then finally, if you do specify down at that level of granularity, you will also have to specify how much RAM you want in each of your NUMA nodes. Finally, it's possible to specify that you want this to be a real-time virtual machine. That has some impact depending on the hypervisor. So if you do this with Libvert, it will pre-allocate and lock your memory so that the memory cannot swap out. It will also run a subset of your virtual CPUs as pthread-fifo, that's the scheduler policy that it will use on the host with the real-time priority current layer of one. So they're running as real-time processes on the host and you can specify which vCPUs you want to be real-time by using the CPU real-time mask. Now the thing to be aware of here is that there is an implicit assumption that all of the CPUs will be real-time. So you actually have to write it as a mask that excludes some CPUs for the ones that you want to reserve as management CPUs. So if you care about this, you need to go and look at the spec or possibly even look at the code because it's a little bit unclear right now from the documentation. I'm trying to clear this up to make it more intuitive to use. So we'll see if that gets in. All right. Thanks, Chris. So I'm just gonna cover the last few topics here. So as we said, we did a talk yesterday about some of the performance impacts and so I won't cover that, but certainly just like CPU contention, the same bad things can happen for memory contention. One of our big learnings is enabling huge pages gives a very significant performance impact. It really allows you to have fewer mappings and more likely to stay within that massive cache for the TLB. So really, we wanna make sure that huge pages are configured for memory and that automatically disables memory over commit as well. And again, since we're in a multi-numa environment, you've got memory allocated between the different Numa nodes. So just like Chris showed on the CPU side, I'm gonna show you how to configure that on the memory side as well. So again, if you've configured the number of Numa nodes on CPUs, you actually have to do that on the memory as well. Otherwise you'll get some very cryptic errors and not really know why your VM isn't booting. So again, Chris is doing some good work upstream to try and help with that. And so you can also specify the RAM allocated to each Numa node as well. And then lastly, as I talked about, it's relatively easy to get huge pages configured. And so there's small, large, any two meg and one gig. And we sort of found in our benchmarking that two meg is the best. So really some nice options along with your CPU allocation model, some really nice options on the memory side as well. Again, it's very important to consider the topology of your servers. If you're in a homogeneous environment, it's relatively easy, depending on how your cloud is built out. Again, mapping of PCI buses to NICs to Numa nodes, all that can have an impact. And even if you're using PCI pass-through, if your NIC is on Numa zero, but your VM is on Numa one, you're gonna end up crossing the QPI bus and that will have a performance impact. So a little bit of awareness on your hardware topology here and how you wanna map to those different configuration parameters is really important. Also on the host and guest side, certainly choosing your virtual NIC type is very important. So E1000 is probably your lowest performance solution since it's a fully emulated NIC. Paravirtualized NICs like VertIO are probably the best option. And certainly that gives you some very nice performance boost. And then PCI pass-through and SROV really takes that physical NIC and maps it in into the guest. But again, there the downside is that you're actually tying your VM to a physical instance and that limits your ability to live migrate your VM from one node to another. And you also actually have to manage more that's going on in your VM. You'd have to pay much more attention to the network security as well as whether you have the right drivers in your guest as well. We've done a lot of work with DPDK based V-switching and if you use a virtual switch based on DPDK you can actually run DPDK in the guest and get a very close to line rate on a small number of V-switch cores and using a Polmo driver in the guest gives you a huge advantage in terms of performance there and all the benefits of being able to live migrate that VM from one host to another. And I think earlier today there was a really good presentation on live migration that I hear was a jam-packed room so I look forward to going and watching that video. So in summary, Chris did a great job walking us through the booting of an instance. It really touches all the core OpenStack projects. I think it's really important to understand how that boot process works so then you can sort of think through some of the performance implications of these other parameters that we showed you through the talk. You know, I forget which one but there's one of those commands that has just a huge number of boot arguments and choosing wisely there is really important. But again, this is engineering for high performance so the tuneability and doing some experiment structured experimentation I think is really important. That's how we learned our way through this and I think overall predictable performance for those critical applications will really help the adoption of cloud solutions for those applications that are really targeting critical infrastructure. So that's really the end of the talk folks and really like to thank you for coming to the very last session of the conference and happy to take any questions you may have. I think we have a race. Mm-hmm. Thanks for the presentation. What's next? I would like to know more about the binding, the affinity between SRROB virtual functions and the guests. I think you had a slide where you were talking about it. Could you explain a little bit further? So by default right now in the code base, Nova will, if the PCI device reports its NUMA affinity properly, Nova will find the instance to the same NUMA node as the PCI device. There are two issues with this. So one of them is there are PCI devices that don't report their NUMA node properly. And so in that case, you're not guaranteed that it will give you the right affinity. The other issue is that if you have other instances that are being scheduled on that compute node, they may actually fill up that NUMA node before anything comes onto it that actually wants to use the PCI device. And so then your PCI device is basically wasted. So there is a spec that's in progress right now that would basically prefer to only put instances on the NUMA node that want to use the PCI devices on that NUMA node. And so it would tend to push other instances onto the other NUMA node so that they don't, so that it'll basically reserve that NUMA node for ones that actually want the PCI devices. There's a final issue in that sometimes you wanna make use of the PCI device even if it's on the other NUMA node. And right now there's not that level of flexibility. There's another spec that would allow that to the introduction of a sort of a soft affinity where it would allow that if you chose to allow that. And then if you really cared about performance you could set that to a more strict setting and then it would fail rather than give you the cross NUMA traffic. Just one more. Yeah. Is there a way, what is the state, what is the project VM with Azure Adobe is working? Is there a way to check how it's operating if there is affinity or not between the device and the VM? So right now there's no easy way to do that without logging on to the compute node and checking where it is. It's pretty much a manual action. But what command, for example, is on the PCI device? Yeah, so you can look at the topology in Linux and it will show you which NUMA node your PCI device is on. So you can look at the PCI bus and you can see which NUMA node it's on. And then you can look at, if you use directly the Libvert commands, you can do vCPU info or vCPU pin and it will show you which CPUs that instance is pinned to. And then from that you can look at which NUMA node those CPUs are on and make sure that they're all on the same. Next page optimization. Yes. If all of my flavors are multiples of a gigabyte. Would I want to do anything less than gigabyte huge features? If you know they're all gigabyte, then you may as well make it gigabyte. Experience with running bare metal nodes using Ironic and would you have any recommendations for how to optimize that? Sorry, no. Sorry, no. Okay, so I have a question related to the SRA we've come out. Right. So we try to evaluate the merit or demerit of using various technology like SRIOV or DPTK and things like that. DPTK is great. You can do live migration, push some policies and things like that, but it ends up burning your CPU cores. Yes. You have to compromise or pay the price that way. The other option is SRIOV. You don't compromise your CPU cores for getting the performance but the flip side is there is no live migration. So and of course you cannot put any other policies and things like that to OVS and things. There was some work in community I saw but it looks like it is abandoned kind of thing to make SRIOV the VM switch, SRIOV to live migrate using some McVillan or something, some kind of options. So just wanted to know if you are familiar with that and is there any community effort to make it available in maybe future release where VM with SRIOV ports can also be live migrated. So I'll go first. So I'm certainly aware of some of that work. There's still some important work to be done in my opinion on live migration before we tackle the SROV stuff and there's also some resource management work that Chris and I are talking about that I think is probably need to get that taken care of first. Yeah, so specifically around the live migration right now if you do anything that gives you better performance so dedicated CPUs, huge pages, numinodes, any of this stuff, live migrations are not reliable in the current upstream code base. So it may appear to work but you may be overlapping CPUs on the destination. It might fall apart on the destination. It's just generally the resource tracking is not currently there and so there's problems around that. It will probably get fixed as part of the placement and allocation work that is being done right now upstream but that probably won't land until like Queens or Rocky or something. Just to add, so I intentionally didn't go with other things with live migration like if you have used pages. As soon as you do a PCI pass through that gives you a pneumatopology and you run into all the same problems. Yeah, and another related problem with live migration with some of these options is if you have huge pages and especially one gig kind of things, live migration depending upon the activity in the VNF for example may take forever to finish because the page is getting dirty and now the entire one gig unit has to be pushed through the wire to the other host where it is being migrated and it can go on forever maybe for several minutes or tens of several minutes. So that is another problem with this but specifically I was more interested in the SRIOV live migration part. So if you have any insight on it that would definitely help. Nothing at this point. No, no, not. John Garber gave a live migration talk earlier today. Have a look at that. Basically it comes down to a compromise on how long you're okay with downtime versus whether you actually want the thing to happen and that's basically you to sacrifice some downtime in order to make sure that that can be successfully and there are strategies and over that will allow you to do that. So John Garber is about four o'clock this afternoon live migration. Yeah, so at a high level you can tweak the amount of acceptable downtime during the migration. You can also force the migration to complete which will just pause the instance and push it across. You can enable auto converge where it will slow down the guest in order to make it more likely to get to allow it to complete and then finally you can enable post copy which is useful but really dangerous. We experimented with post copy and as you mentioned widely I mean it is still in very early stages and has a lot of side effects and things become very unstable. Any other questions? All right, well thanks everybody. Thanks Chris.