 that can help make high-performing workloads possible in the cloud. We'll talk about some of the challenges we faced and how we bravely overcame them and some future plans that we still plan to deliver on in the months and releases to come. So as I said, most people here are pretty familiar with OpenStack. I assume, is there anybody here who has no idea whatsoever what OpenStack is? I didn't think so. Who would actually raise their hand? Yeah. So I'm going to go quickly through this. I basically wrote this myself, so this wasn't copy-pasted from a website or something. So it's basically cloud infrastructure, whatever that means to people. It's open source and mostly written in Python. That number is made up, but it's mostly written in Python. It's kind of organized this multiple projects, each of which has its own kind of like thing it's supposed to manage through an API and potentially like a dashboard. So it pretends to be infrastructure as a service solution. So more about OpenStack Nova itself, which is the compute part, which is probably what the next bullet point is going to say. So yeah, manage compute resources, so basically VMs that consume CPU and memory on your compute infrastructure through a REST API. So it allows you to schedule VMs across your pool of hardware. Storage and networking is handled by different components, and I won't be talking about them here. And it allows you to start, stop, resize and snapshot your VMs and also allows administrators to do more advanced tasks like migrate, live migrate. And what's I guess maybe interesting and it gets maybe a little bit forgotten in Red Hat is that it supports several different virtualization technologies, but we mostly talk about LibVert KVM, which is what most of the features, actually all of the features I'm going to be talking about in this talk are relying on. So I guess a little bit of the kind of like what the whole Elastic Cloud idea is about, and this is interesting with this talk because it kind of a little bit is at odds with some of the requirements that are put in front of developers when they want to start caring about high performance and about proper utilization or specific utilization of resources. So the idea is to allow for quick provisioning of commodity hardware. And the way it's supposed to do this is that instead of allowing the user to consume resources, it presents them in an intersection called flavors, which is like a predefined combination of resources you want to use. And it allows only very simple scheduling, which focuses on scales. So you don't buy like in its simplest deployment scenario, it doesn't try to do anything optimal or sparse. It tries to be fast and simple. Like I said, other stuff is handled by different projects and we won't be going into them although they might be interesting in the big picture, but this talk is mostly about what NOVA lets you do. And I guess an important thing to emphasize, which I already said a couple of times, is that there is no visibility into the hardware their workloads are running on. So user is supposed to be completely oblivious to what is actually the hardware they're running on. This is like a brief slide of the architecture. And I guess the important thing is that there's an API that accepts requests and then there's message queues, so most of the stuff is asynchronous. The interesting thing here that I'm going to be talking about a little bit more in detail is the NOVA scheduler and how that presented some challenges for us. And this is actually in the next, I guess, two slides. But yeah, that's basically it. It's not too important for this talk to go into these details, but the general idea is like I said, there's an API and then it's synchronously dispatched to a queue where workers or actually scheduler decides where to place the VM and then it gets done or fails as the case may be, but mostly gets done. So I guess it's interesting to maybe go a little bit into this and see how this fits with the high-performance requirements. There's flavors, as I said, and they carry basic information about the resources that will be assigned to an instance. So it's sometimes in the simplest case just the number of CPUs and the amount of RAM that or potentially disk that the VM will have. It can be a bit more complicated than that and it can be overridden through image metadata, which is user-controlled while the flavors are, which is also an important thing that I guess I forgot to mention there, admin control. So admins would define them and that's the only thing users would be able to... or the granularity at which the users would be able to consume resources. And yeah, the scheduling is basically... there's no optimal placement, it just has a list of all the hosts, puts them through a set of filters that admin can tune to an extent and any hosts that pass all the filters are considered. Basically, filters are functions that return through or false for any given host in your pool of compute hosts. And this is... most of the filters are written without going into specific devices or resources without considering specific devices or resources rather considering how many of them are there and just grabbing whatever is there. That possesses a problem when you want to start thinking about how to actually assign a specific resource like a specific CPU or PCI device to a VM. So I guess what I try to say here is that the basic infrastructure framework that Nova provides does not necessarily play well with trying to do stuff that we try to do in order to give more guarantees to the users about performance of their VMs. So why should people care about high performance? If they want high performance, they should probably use bare metal. That's kind of what some people think and say. But as we all know, everything is moving to the cloud, so we have to adapt. But there are some interesting, I guess, use cases, all joking aside. And the famous one, I guess, or the poster child of this kind of stuff is NFB where telco providers want to move a lot of their packet processing into VMs and that can be latency and performance sensitive. So that's kind of where the idea and the push came from. But there's also the fact that there are modern hardware is NUMA capable which is something you want to maybe start considering even for non-critical applications. And so when we started thinking about this, we realized that we have to find a way to expose or let users request certain high performance characteristics without actually allowing them to choose CPUs or specific devices because that's not something we really want to do or actually can easily do with the existing APIs and abstractions that NOVA and OpenStack provide. And that was, I guess, the kind of design problem we tried to solve when we started doing this. So the next couple of slides are going to be about specific features we added and specific performance enhancing features and how we expose them through an API and how that works. So I'll try to explain how that's different from just saying I need three CPUs and hopefully or X amount of RAM and hopefully that will be clear because that's kind of the crux of this talk. So the first thing is obviously we wanted to make NOVA aware of the NUMA characteristics of the compute hosts because this can help with memory access latency. I'm not going to go into a lot of details here, but yeah basically if you're accessing memory that's not on the NUMA node the code is running on there's a penalty to accessing it or it takes longer to access that memory which is obviously something we want to avoid as much as possible. For performance sensitive workloads we don't necessarily care about this or maybe we don't need to care about this for cheaper VMs but it's important for high performance workloads. And this is also interesting in combination with dedicated IO devices. Yeah, so this is for... The first time I gave this talk was at KVM Forum so there was a little bit of Libvert KVM stuff here but Libvert will expose host capabilities I mean NUMA topology of the host so first thing was to somehow take that and store this into NOVA and start making placement decisions based on this. And so this is basically how we decided to expose this to the user. So we would only allow users to actually admins to say that a VM can have a certain number of NUMA nodes. If you just say one, that's also a valid thing but it still means that NOVA will make sure that that VM is contained to a single NUMA node. And then there's some additional options that I guess has limited usefulness that it's there. And so I guess the interesting thing here or something I would like to emphasize is that there's no way for users to choose which NUMA nodes this goes to or it's something like that that just says I want this to end up confined to a NUMA node. Yeah. So how is this done in NOVA? In NOVA, as we've said, there's a way for a compute host to expose this information to the NOVA scheduler. Requested topology is saved as well and then once the instance is being scheduled and placed on a host we would basically see if it based on the number of CPUs and the number of CPUs available on the NUMA node see if it can fit there and if not move on. So as you can see, this is different than placing VMs on hosts. We're basically placing them on NUMA nodes and the fitting algorithm would fit it, place it on any available NUMA node on the host so there's no way to request a specific NUMA node which is kind of different when you also want IO devices as well like PCI devices but we'll get to that. And then finally, since LIVERT driver is the back end for NOVA is the only one that implements this LIVERT does the work of actually confining this and this is done by the vCPU pin element here that's when you define the domain in LIVERT and as you can see we basically confine a set of CPUs to a set of host CPUs so as you can see here we don't place them on single CPUs we place them on the set of CPUs so in this case 0 and 1 would be the first NUMA node I guess of the host. This is also important for memory as well not only CPUs so this is basically telling LIVERT to assign memory from a specific. So next thing we did is huge pages awareness. This is also interesting because of performance I guess the win here is that there is no if you don't have a lot of pages to keep track of then you can cache all the virtual to physical addresses in a constrained resource called the TLB so it makes you use the TLB better and that makes it perform better and the idea I don't know these slides were done by a colleague of mine who actually did the work here and there's as much as 15% improvements and there's also a reference there so I guess sorry that's the idea. So this is an example of how much more efficient it is if you have larger page size so you basically need only one entry in the cache and you don't have to actually do all the walking of tables you just have a cache hit and you have your actual physical address and that's obviously way more efficient and important for workloads that care about that kind of stuff. Different architectures have different page sizes but that's not necessarily important here rather than knowing that we have to somehow expose this through the API in a smart way because you don't want to make assumptions on what kind of hardware you're running on so the API bits of this was setting allowing the flavor to just say use large pages what that means is as we said depending on the actual hardware and then we added some kind of any which allowed us to actually let users request it if the admin allows something like that which I think is also slightly limited and it's usefulness but it's there. I think what's interesting to mention here and I probably forgot to mention it earlier is that this kind of brings in the fact that I think it's maybe mentioned later you have to set this on the host when you boot it so it kind of breaks the promise that you can quickly provision any type of hardware there's now additional work that needs to be done there by the sys admins or operators of the system but there has to be some kind of a trade-off and we don't live in the perfect world so this was deemed acceptable by people but it's worth mentioning as this was one of the constraints of the design and yeah so huge pages are implemented on top of the NUMA awareness so they are per NUMA node and there's even a we allow asymmetric allocation which is apparently a feature people wanted so yeah so this version of LiveWrit and above exposes huge pages or yeah exposes the information about pages available in the host so same as before we can grab this from all the compute hosts and consider this when placing VMs and yeah so this is how this is implemented once we place the VM this is how we request this from LiveWrit NOVA LiveWrit driver generates this XML and passes it over to LiveWrit so CPU pinning is actually maybe not the best name for this because it's actually giving VMs dedicated CPUs so as we said normally you would let your VM use any available CPU and this will be in the KVM case and LiveWrit case up to the OS scheduler to the side but for obviously performance sensitive apps you want to allow a VM to completely use a CPU so that's one thing and then there's hyper threading comes into play here as well which presented interesting problems and sometimes you want to prefer being on the same basically physical core because of better cache utilization sometimes you don't and that's something also that's very tricky to properly model in the world of resources that NOVA knows about so the problem and where this abstraction breaks a little bit like with large pages you want to have a separate set of hosts where you place these VMs because mixing workloads that don't care about performance and the ones that do is just something that we decided is a little bit tricky or unnecessary complexity that we don't want to handle in the code so we passed this off a little bit to the system operators and told them if you want to provide this feature you need to use a feature of NOVA that allows you to separate your hosts into what's called aggregates so if you want to use this you need to use this in a you know in combination with saying VMs that go on these hosts are only performance VMs and they will only be pinned to specific CPUs and there's no obviously no over commit in this case because this kind of defeats the purpose of dedicated resources and yeah so kind of like the idea maybe a subtlety here is that it trades off the maximizing hardware utilization to having high performance VMs which is kind of like the more expensive VMs that people kind of expensive people want yeah so we had again a simple API that says we want to share their dedicated CPUs shared is the default this is what NOVA would do normally dedicated would have to be combined with requests that says go to that specific set of costs and then in that case NOVA would keep track of which CPUs are dedicated to which VMs and allow you to make these guarantee you that the placement will be dedicated very similar to the other parts of it the other features we implemented this stuff is exposed tracked by NOVA placement decisions are made the placement algorithm currently is not super smart it basically finds the first best fit I think it tries to be a little bit smart so it will try to utilize the CPUs of different NUMA nodes equally but it doesn't do it in any kind of like globally optimal way it's just in the way how we sort the available CPUs when we try to place VMs and so far we haven't seen any problems or complaints from users with this so far it seems to work yeah I don't know that's basically so far it works in theory it's not very optimal because it goes through all permutations of possible NUMA nodes which is if you have a lot of NUMA nodes it starts to get slow I guess but we're not there yet so we'll fix it when we get there maybe yeah that's it so this is the same as confining to a NUMA node except you confine to a single CPU basically giving the VM dedicated CPUs and NOVA tracking makes sure that you never share a CPU on that particular dedicated host so a third set of like an additional feature we added is also if you request NUMA aware instance and you request a PCI pastor device it kind of doesn't really work for you unless the device does IO on that same NUMA node and this was added as well this was not the work was not done by Red Hat but it's a good example of how a couple of companies collaborated on these kind of requirements and yeah that's basically it and the problem here is that you can't like the takeaway here or the bug that we have is that you can't match devices to NUMA nodes so if you request a two NUMA node VM and you have two PCI devices they may end up on the same NUMA node which may not be what you want you want them to be in separate NUMA nodes but there's no way to expose this yeah so this is what were the good parts the good parts were that at least in theory but from what we hear from customers this enabled OpenStack to be used by a set of users that would not necessarily have a lot of use for it and they all got involved with this in planning some of them in development and it was just generally a good thing for the community as I so eloquently put it in the second bullet point but there were some problems with this there's, as we've seen with some of the examples with large pages and CPU pinning it does complicate your operations because now you have to know which hosts are which kind of breaks the promise that you just throw hardware and scale out infinitely and yeah it's as we've seen tricky to expose these details in a reasonable way like for example with the NUMA nodes and PCI devices and there are some specific challenges with the way this was implemented and done because it's not used by a big chunk of users so it's off by default which means it doesn't get all the testing we would like it to get we're getting better at it but it's just the reality of life and once we start in doing this a lot of the internals were just not designed to ever handle these kind of requests as I tried to describe in the beginning and this is where a lot of the development time was spent but in the end we all learned a lot so or something that's great so I kind of put this in future plans but it's really a bug that live migration won't work properly so we should probably fix this well yeah it's a bug for just purely VMs that don't do any pass through so we should definitely support this properly currently we don't do it properly sometimes it might work sometimes it might not yeah there's some features the threading features were actually merged not long ago it's not released yet but it's merged and this was also done by engineers from different companies so that built up on a lot of the work that Rathead folks did so this was definitely a good example of how this model worked well and the vice-pastor as I've spoken already so with that I will allow you to ask questions there's three scarves here that I'm supposed to give to the best three questions but yeah so give me at least three questions okay yeah so there is support for SRIOV that's what I mean by PCI devices support the reason why I didn't yes in the general case doesn't have to be you can pass through for example non SRIOV devices but majority of users would probably use it there was a good reason but I forgot yes I should have looked it up this work was done by done like a year ago now so yeah I kind of forgot about those details sorry about that but thank you for asking come for your scarf if you want no in theory it could be but it's not so as long as Libvert basically exposes this kind of information which I think it does I don't know but yeah as long as Libvert can tell us about this in Nova Land we could probably do it but we don't scarf actually maybe there will be more questions so let's see who gets the scarf now well so it is you can do this I mean as long as like Nova would allow you to say that these devices are fair game it's just not something we necessarily focused on and not something that the community focuses on because a lot of the community members that work on this are also from the from the networking world so it's I it's not tested for example that's or not heavily tested but yeah it's something that we interior support GPUs so right so the question was does Nova allow for evacuation of these VMs and that is a good question because when you say evacuation Nova has uses this word very liberally so it can mean a number of things but in the in the simplest case it means that you know know this down you because you shot it down because something's wrong with it and you want to start these VMs somewhere else there's two ways to do this to tell Nova where to put it and to let the scheduler decide again if you let the scheduler decide it will work if you tell it where to go then it's not really tested it should work but I wouldn't as if you're a customer I wouldn't tell you yeah 100% use this we would have to have to test this but as long as you let Nova decide it will do it will do make them make the make the same decision when that it did when placing the VM okay thanks a lot if there are no more questions three people who ask questions and one scarves can come and get them here because all questions were amazing and yeah logos before you leave do you have a presentation or you have it on github or poop poop poop poop poop I think I have some I yeah I guess I can we have to get some stuff and if you have any questions can I actually send you an email and I don't want to Google it now so we will contact you send me an email please okay this is just because the presentation will be available online yeah yeah I will definitely send you the pdf I will send you an email please okay this is just because the presentation will be available online I will definitely send you the pdf version of it can I get a sticker? or a speaker? wow and a hoodie and a scarves nobody likes scarves they're not nice have you had trouble giving them all away? I think it's because it's not cold it's not that cold no if there were like minus 10 centigrade it would be more popular last year it was snowy yeah but probably people who have that anyway you already got the sticker? and the presentation presentation but we can get even the html yeah thanks dude it's advertising itself as a widescreen display but it isn't actually a widescreen display so I don't understand well I'm glad I came in early this morning because I noticed that it looked really crappy and I started to think about how to I mean that cost a fortune right? even the bulbs cost I think the bulbs even a several hundred euros but if you use there's no chance to specify there should be a company I'm going to take a picture of the room at the start of the presentation and I'm going to try and incorporate it into a demo I will explain this again in the thought but it's basically it's difficult to demo what I'm talking about in a way that looks interesting at all because it's just waiting so you see people like stuff do cockpit demos and they look so cool and then I was like hey could you do a demo? I could but you know go get a cup of coffee that's all there is so