 So hopefully some of you were at the earlier talk introduced the placement service which is what we've been working on for at least 18 months or so and this set of slides is going to go a whole lot deeper. So what this set of slides is really about the internal queries that the placement service is making. So if you like SQL, you're going to love these slides. If you don't, sorry. Okay, so just brief introduction, this morning I talked about how the placement service is addressing a number of messy areas but I kind of just kind of skipped over some of the mess. So let's take a look at some of the issues that we're trying to address in code. So things were really easy when everything was just, you know, CPU and memory. We didn't have to deal with shared anything and everything was a VM over time. Bunch of stuff built up and got a lot of crufty code sitting in Nova that the placement service is attempting to address. Some of the inconsistencies in the resource tracking, you've got vCPU and RAM, they seem relatively simple. They seem relatively simple but when you throw things like NUMA and CPU pinning, that simplicity just, you can throw that out the window. PCI devices, they have their own resource tracker entirely. So there's a Nova PCI set of sub modules that handle PCI device resource management on the actual Nova compute node so that's tracked entirely differently. It's got some PCI devices table, totally different. NUMA resources, they don't have their own tracker necessarily but they have a whole separate module and they're stored in a completely different way in the various databases than other resources. And then there's Ironic, I'll just leave it at that. So Ironic nodes are essentially indivisible units of things, right? And because normal compute nodes that pass out resources for vCPU, for VMs basically, they're cutting up the resources that they have locally and passing that out to a bunch of different consumers of those resources. Ironic nodes are completely different, right? An instance that gets booted from Nova and on a bare metal Ironic node, they get the whole thing, right? They don't just get some CPU from the bare metal node, they get the whole thing. So we kind of have to view Ironic bare metal nodes as atomic, indivisible types of things. All right, disk, everyone's favorite friend. Disk, the inconsistencies around disk reporting are many, they're kind of infuriating. There's a number of things basically depending on the type of disk and the hypervisor that you're using, whether or not you're using shared storage or local disk, all these things affect the reporting and tracking and the varying degrees of accuracy of disk resource management. There's also disk available leased. Anyone actually, Chris is like laughing, does anyone actually know what disk available leased does? Yeah. It's totally different depending on what sort of image back end you use and everything. But yeah, it's supposed to say like, you know what, I'm not even going to get into it. It's just so weird, anyway. Inventory and allocation data. So if you were at the talk this morning, I kind of introduced the concept of inventory being the thing that, you know, it's the kind of capacity of a various resource on a provider. An inventory provider is providing some amount, a total amount of VCPU, of memory, of SRIOVVFs, whatever it is, right, an inventory is the capacity. Allocation data, which is like usage data, this stuff is spread all over the NOVA database, right, in the compute nodes table, in the instances table, PCI devices table, instance extra table. Inventory and allocation information is stored all over the place. Some is just a field and a table. Some are records and a table, which we'll show you in a second. Some are these blobs that are serialized and, you know, JSON blobs that are serialized and shoved into the database. So all this inconsistency results in a whole lot of mess inside of NOVA, especially in the NOVA compute resource tracker module and the vert layer itself. There's a lot of conditionals, a lot of switching between, oh, is it a PCI device? Go and, you know, ask the PCI device manager to do X, Y, Z. Is it memory? Okay, cool. Is it NUMA? Then, oh gosh, you know, so an illustration of this problem in code. This question should be generally pretty easy to answer, right? It's like one of the most common questions that people would have, or the cloud management platform would have, is like, how many resources are available in my cloud? You'd think that there'd be a simple way to get this information. Well there isn't, right? So the general, the formula is pretty easy. The available stuff is the inventory minus the sum of the usage. That's how it should be. So how do we get the, how do we figure out what the total amount of RAM is on a host? It's relatively easy. We query, you know, select memory MB and minus the reserved host memory MB from the compute nodes table where the host is some identifier. But keep in mind, as our little stormtrooper guy says, compute nodes table is in the cell DB, which means what if I want to get the total amount of RAM that's available in my deployment. I need to go to each cell DB and look at each of the compute nodes tables and each of those cell DBs. Total amount of disk space on a host. Easy? Well I already told you that disk is like the bastard child of resources, but yeah. Okay, so yeah, supposedly you can do select local GB from compute nodes where the host equals host. Does the host know about shared storage? No. It really doesn't know. Like at Varlib Nova instances where the disks are stored for instances, Nova doesn't know whether that's an NFS share or some local disk necessarily, the Nova compute. There's no indication in the, like local GB, it doesn't mean local GB. It's not local. It could be or it could be shared. So again, bastard child of resources. So how do we get the host's NUMA-affined memory? This is a fun one. So we go and grab the NUMA topology field for all of the compute nodes or for the compute nodes where the host equals host. We marshal it into an object from the database and then in that object, there's a set of cells and we sum over the cell.memory to get the total amount of memory that's fine to the host. So that's fun. By the way, Windows is totally messing up the formatting of these slides, but that's right. All right. How do we get the total number of SRIV virtual functions that are on a host? This is actually reasonable. It's like just count the number of records in the PCI devices table that are associated with that compute node for a type of virtual function. How do we reserve VFs? Modify the PCI pass-through whitelist configuration option. Awful. Anyone have the fun of modifying PCI pass-through whitelist? Yeah. What is that on PowerPC? Even worse. Right. So yeah, you'd love it if all this stuff was consistent, but no, it's not. All right. So that was the inventory information. Let's see about getting usage information. So the stuff that's allocated to users. So total amount of RAM that's used on a host. You think we could just go and get this? Because there is a field called memory underscore mb underscore used on the compute node table. And you'd think, let's just go select memory underscore mb underscore used, and it's wrong. So instead, what we have to do is we have to grab all the records from the instance extra table for instances that are associated with a compute node and grab this flavor object that's a serialized JSON blob and then deserialize that and then sum up the flavor dot memory mb. And that'll give us the real amount. So that's fine. Disspace used on a host. You'd think it would be like almost the same process, like take instance extra, flavor. And then there's a root GB and a femoral GB and a swap mb. For the flavor, you kind of like sum all that stuff up after deserializing into objects. But yeah. You get really totally inaccurate, well, you get relatively inaccurate all the way to really weird answers depending on your setup. You are the expert on BDMs. Should be BDSM. So the amount of RAM used for NUMA cells, right? So previous slide showed you how to get the total amount available of memory that's associated with NUMA cells on a host. So the same thing here, but notice we're using the exact same field, NUMA topology, from compute nodes. So we actually serialize both the inventory and the allocation, the usage stuff all in one object and then write it back to the database. This makes me a sad panda as well. Total number of SRIs or VFs used on a host? It's pretty easy. We just add that one like and instance UUID is not null to the end of the query. And that gives us essentially the number of VFs that are consumed on a particular host. And you'd think like, well, like we could just probably, like you know, ironic nodes, we can probably get PCI information about ironic nodes, right? No, of course not. I find your assumption of consistency disturbing. Okay, so now we're gonna get into the good stuff, the sequel stuff. It's good for me, because I like sequel. All right, so this is all about the placement queries. And this section of the slides, I'm gonna try and lead you into building up of the sequel expressions that return the resource providers that we're looking for from the scheduler. So if you were at the talk this morning, you may remember that sort of flow chart of the different components in Nova and how they interact with each other. Well, in that, you may remember that the scheduler in Nova now makes a call to the placement service to get, like, give me the list of resource providers that have capacity for a set of requested resource amounts, right, and traits. Let's just, one thing that I, when we're attempting to do that query, we have to do two types of queries. One is the quantitative part. So we need to look at the allocations table. Sorry. So the allocations table's on the lower left there and inventories right next to it. When we're comparing quantitative things, these resources, we need to sum up all the usage in the allocations table and compare that to the total amount of inventory for that particular resource class. So if we get a request in, it's gonna say, okay, I need four vCPU and a gig of memory. Go find me the resource providers that have that capacity. And the inventory and allocations tables are the ones that we're gonna hit to determine which providers have that capacity. So this is the first simple part of the query. So there's three sections of the query. The top one, we're joining two inventories. And then we left join into what's called a derived table, which is a subquery in the from clause. And that's where we're aggregating, we're summing up the usage for that particular resource class. And the bottom one, the bottom blue section, is our where clause. And that's what is allowing us to kind of filter out the resource providers that don't have inventory. So there are three main sections to that where clause. The top one is just like the sum of the usage, add what we're requesting, and then make sure that that is less than or equal to our capacity, which is total minus reserve times the over commit, which is the allocation ratio. So that's sort of like one level. And then we have these things called min unit and max unit in the inventory record. That basically says, okay, I only want you to be able to request a minimum of 128 meg of memory or something like that. And a max of 256 gig. So we'd store that information for each resource class in the provider. Then we have this step size thing in Kubernetes. They call it quantization, which I think is really cool. Yeah, we can change the field name. So anyway, that basically makes sure that if for instance, I set my step size to 64 meg for my resource class of memory MB, or I set the step size to 64, make sure that I don't have a client that goes, hey, can I get 65 meg of RAM? Basically, it does it in steps of 64 meg. And then we're grooving behind the resource provider. Did I reverse them? Some of the min units is greater than what I'm requesting. Okay, sorry. I didn't. Okay, good job. Which t-shirt? Well, they're both XLs. Okay, I don't care though. Stormtrooper or AT? Everybody give David a hand. Although Chris did help with the, yeah, okay. Nice spot. Make fun of Jay. All right, so this is the capacity query, right? This is the get me the providers that meet the particular resource class capacity requests. Easy one right here. Bunch of traits. Traits are just basically strings and it indicates that the resource provider has or is capable of a particular thing. So this will produce the set of resource providers that have all of the traits that are in the in expression there. So again, we're building up. Where each of these SQL expressions is just producing a set. And so when we're trying to do these placement queries, we're either doing intersections or unions of these sets. So when you're trying to read the placement code and you're trying to figure out how exactly does this SQL query work, all we're doing is we're just getting different sets of information and joining them either as an intersection or a union, depending on what the query constraints are as we'll go forward. Ah, so, weirdly, there's actually stuff behind this in the slide, but I can't see it. Oh, look, there they are. I didn't realize I put a transition on the slide. All right, that's awesome. Okay, so what little Stormtrooper guy is saying is like, okay, we want to find providers with all these traits and that have the capacity for a particular resource. That's awesome. So because it's an and, we're doing an intersection, right? So to do an intersection, you just do a join. So we took the query that does the trait stuff, which is middle-ish, right? And we just join it to the results of the other set that was winning us down to the one set of providers that meet the capacity for that resource class. It's not. Okay, so that's the easy part of the SQL. So shared resource providers. This is when you've got a resource provider that says, all right, I have an inventory of disk, and I am going to make that disk resource available to other resource providers to use, and I'm going to make that available via an aggregate association. The canonical example is storage pools, but we can also model routed network segments and IP allocation pools and some other things. Shared resource providers. So how do we get resource providers that are sharing a resource? So it's actually very similar to the previous query. And we're just joining on this traits table with a very special trait called miss shares via aggregate. And this is reducing us down to a set of providers that are sharing a particular resource. That's all we want, just the set of providers that are sharing a particular resource. We're not getting a set of providers that are sharing a particular resource with someone. We're just getting a set of resource providers that share something. Important. I guess I have a transition on this one too. All right. So, unlike the and expression. So what we want to find the providers that either have all the local resources or they have some of the resources and they're associated with something that's sharing those resources with them. So it's an or expression, which is a union. Oh my God. So what we need to do is do a join, a left join, which gives us all of the elements from the left side of the join and nulls out the entries on the right side of the join for anything that doesn't match. Okay. So we start building up the sequence. I'm sure the greater than equals are still wrong. Okay. Yeah, okay. So what we're doing here is we're left joining the results of I have this locally with what's called a butterfly join, right? Which is the set of shared providers associated via the resource provider's aggregates mapping table. And what that gives us is all of the resource providers that have local inventory for something and left join to the set of resource providers that share that resource. And in the where expression at the bottom, you'll see that there's an or. So it says where and then it's that same block of conditions or the shared provider is not null. So that's what gives us that union. Now, that's the end of the fun SQL. That's just for one resource class. So if I request for V CPU, a gig of RAM and 500 gig of disk, I'm going to be joining all of these subsets one after the other. So that's what the placement code does is essentially generates these sets of information and keeps joining them in one giant query. And it's actually pretty efficient. Good question. So on the bottom here, so this is just saying, hey, I'm requesting an amount. Get me all the resource providers that say they're going to share that that also have that amount. And this essentially gets reduced into that sharing resource or sharing provider set. So those IDs that get returned are reduced into that in the red on the right. So we reduce that into just a simply a set of IDs. And that goes into the last blue line. That in sharing provider set that whole sharing provider set is the result of that previous query. And again, we simply build up the query one resource class at a time. So that's just the one resource class. Okay, so moving on. One of the things that we'd like to do to make things a little more efficient in the scheduler and placement interactions is move the the place that claims resources from the compute node to the scheduler. Or haven't quite decided yet. So my previous slides, I noticed that the top red line there, right now we just, we make a call from the scheduler to the placement service and it returns a list of resource providers that match a particular query. At that point, the scheduler returns back to the conductor and the conductor sends the launch request down to the compute node. And the compute node is what writes the allocation, actually consumes the resources on the local node. What unfortunately happens is on the local compute node, when a claim is performed, there's quite a bit of time that can pass between the initial scheduling decision, goes through the message bus, lands on the compute node, do some stuff and then write the claim by consuming the resources in the local cell database. That period of time, because it's fairly lengthy, another scheduler process could have chosen that compute node, sent the launch request down to that compute node and another instance could have grabbed that set of resources. Very frequent. Yeah, this happens a lot more than you might think, especially in scenarios where you have a packing scheduler, right, where you don't wanna spread the load out over all your compute nodes, instead you wanna pack those compute nodes up and then what happens is that the last little slot of resources on each compute node gets a lot of contention, right, you'll have multiple schedulers all picking that compute node and they'll all be competing for that last spot. That retry mechanism is very expensive in the system as a whole and we'd like to reduce that length of time and also the expense of retries. We wanna do it for efficiency reasons and we also wanna do it because of the cells V2 architecture. The retry actually causes what we call an up call from the cell up into the superconductor. So, in Okada, the, hold on a second. Ah, okay, sorry, I forgot a little thing. So in Newton, no. Yeah, in Newton, the compute node was not required to write information about the allocation to the placement service, right? It was only claiming locally in the cell DB. In Okada, we're now requiring that the scheduler calls placement when the instance gets launched on a compute node and inform the placement service, hey, I'm allocating against this particular resource provider, but we still haven't solved any of the retry issues. Now, in Pyke, we're going to do the claim in the scheduler and the claim being post allocations. That consumes the resources from one or more resource providers for a particular request. That should eliminate 95% or so of the retry causes. All right, so let's talk a little bit about nested resource providers here. So sometimes there's this natural parent-child relationship between various resource providers. Say you have a compute node and it's got a couple Xeon processors, two NUMA nodes. It's a natural parent-child relationship to have the compute node be a parent resource provider and the NUMA nodes underneath them, child providers of vCPU, memory, pneumatopologies and other things. You can have multiple layers of nesting in this hierarchy. And then this sort of bottom graph here I'm showing. Let's say you've got two SRIOV physical functions and each is a find to a different NUMA node. The natural relationship would be the compute node is the top level parent or the root provider and then the NUMA nodes underneath there would be child providers and then the physical functions underneath that would be children of the NUMA nodes. And that represents the affinity of the physical function to the NUMA node. We can even go one level further and let's say we wanted a bunch of VFs, a bunch of virtual functions in each physical function and we wanted to associate different traits to each virtual function. We would set up yet another layer of child providers underneath the physical functions and associate traits with the different virtual functions. So anyway, luckily we are limiting things to a single parent and not going totally crazy pants here. It's a little difficult to read here. What I'm showing here, two compute nodes that each of the compute nodes has two physical functions. In the left side, each of those physical functions has an inventory of virtual functions. Is that, can you guys see that? All good there. On the right side, there's two physical functions but one of the physical functions is set up as a pass through. Meaning that the top level compute node provider has an inventory of type SRIOV net PF. And that signifies that that physical function is available as a resource to the guest. Meaning it's a pass through. Does that kind of make sense? The thing that you decorate with a trait and that has an inventory record of some resource class is the resource provider. But in the case, gosh, I'm not really explaining this very well. In the case of a physical function it's a pass through where decorating the resource provider record with a trait but the inventory is different. It's a PF, not a VF. Anyway, this is the stuff that we're gonna start working on in a few weeks here. There is no resources don't have traits applied to them. Resource classes are just a description of a thing. So if you want to get, so a PF and a VF are different things that you can pass out, right? And a trait is more of a capability of some provider. It's not a description of the resource itself. If you wanted to differentiate between resources you'd make different resource classes, right? You'd have purple foo and orange foo. Those are resource classes and your provider would have an amount of one of those things. But if I want to indicate that a resource provider has a particular capability I associate a trait with the provider of those resources. So the quantitative side is cleanly separated from the qualitative in that way. No, there's an inventory record for each resource class for a resource provider. So yeah, the resource class SRIV net VF refers to a virtual function. It doesn't refer to a virtual function with this set of capabilities. If you wanted to do that you'd make a resource provider called the VF and then you'd tag that VF with a particular trait and that VF resource provider would have an inventory also a SRIV net VF. It gets really complicated. No, this resource provider hands out these types of things. This resource provider is decorated with these properties. The thing that they're passing out, doling out for consumption isn't decorated with anything. The thing is the resource class. All right, cool. So there's one prize left. Only people with Excel signs may answer. Okay, so which component determines the NUMA cell that the guest is placed on? First one to answer who's in Excel gets this. It's the Nova compute. It's not the scheduler. People think that the scheduler filter called NUMA topology filter actually determines which NUMA cell, a particular workload, is placed on. It does, but then it immediately throws away that information. So that's why I say it's a trough. So yeah, the NUMA topology filter and the PCI pass-through filter or PCI device pass-through filter, they determine whether that particular compute node has the capacity to do something. And in fact, the NUMA topology filter will run this hardware.fit instance to host topology. I think it's the name of the function. And it will determine the exact NUMA cell that that workload should go on. And then it immediately forgets it. Sends the request down to the compute node and the compute node reruns fit instance to host. And that's what actually gets saved into the database and assigns an instance to a particular NUMA cell. So, lovely. Good job, Tims. So that's my deep dive into placement queries and resource tracking. Thank you and anybody have any questions? Hey, I just wanted to say kudos for using, for doing all of your processing in SQL. Earlier today I was in a session where somebody did my SQL command select star from some table and then pipe that to WC minus L and the answer was four million something. So you can imagine all of that extra data and crap that was coming to the, so kudos for that. Well, thanks. Well, those of you who sort of work with me know that I kind of dream and think in SQL, unfortunately. So, yeah, it's kind of what I, yeah, it's an OCD thing. We didn't talk about overhead. I didn't talk about overhead. Matt wants me to talk about overhead. Overhead, would that be like the other bastard child of resources? It should have been in the list of terrible things. Yeah. So overhead, first of all, does anyone know what overhead is? Not you. All right, what is it? Kind of cheating because you talked about it earlier, but if I ask for two gig of memory, then my VM gets two gig of memory but it actually consumes 2.1 gig or something. You don't get charged for it. And it may be a percentage of the amount that I request. It may be a fixed amount per VM. It may depend on the host. It may depend on any number of things. So you can't predict ahead of time. Any number of things, and it's hypervisor dependent. So each hypervisor has this estimate instance overhead function in the VertDriver API, and makes a calculation of the resources that an instance incurs that they're not gonna charge the user for and that kind of in a little hidden way get consumed on the host. So basically to ensure that certain hosts don't over commit more than they've already said that they're going to over commit. And again, it's hypervisor dependent. So Zen, is John here? John Garbit? All right, so yeah, the Zen server was the original one to add instance overhead and it basically looks at the amount of disk space and then adds memory. Yeah, it's a percentage of the disk space that's requested and then it converts that into some memory footprint on the DOM zero or something that needs, anyway. So that's crazy. Hyper-V has its own thing that's disk and CPU related. Disk and memory related. Okay, no, I'm sorry. Livert has the CPU thing with the emulator threads. That's fun. And I don't know what the VMware driver does. Doesn't have one? And yeah, obviously, Ironic doesn't have one. Yay. Okay, anybody, any more questions? How are you? Hello. Let's say you have a shared storage scenario and you want to do migration or line migration of instance. Can the placement API prioritize compute nodes that are on the same shared storage? So basically, it will be more efficient. Can it? Can it now? No. So affinity and anti-affinity, we really just kind of started the conversation on what we'd like the user experience to look like in going forward. Matt's gonna put together a spec that goes over that user experience and then we'll figure out how to add in what we need on the placement service side to make the aggregates aware of the distance between each other, which represents affinity. So maybe Rocky, that, to be honest. Sorry, Matt, maybe Pike. Matt. Sorry, just thinking, going back to the extended discussion we had about distances between things, that's actually an excellent example of there being more than one answer to the question, what are the distances between these two things? What is the meaning of life? So it could be a network distance. So if you're connecting to similar storage, for example, which is what the discussion originally came up from, that's gonna be a network distance. But this is a connected to the same shared storage, which is membership of an aggregate. So it's a completely different distance. Um, well it depends, right? So I mean if you want to live migrate and you wanna keep it near something, you just look at all the compute nodes that are associated with the same aggregate as the one that's associated with the storage pool that you were on when we migrate, right? So we look, we say, okay, your storage pool that you're associated to the source instance, the source host is associated with aggregate X, which is providing a shared storage pool of resources. That aggregate association, when we migrate, we just say, hey, find me another compute node that's associated that aggregate that isn't the same as the source, right? Sure, there's a distance of one, I guess. Dins? Kind of related. Yeah. That's a good question. I mean the question if you didn't hear it was how far away are we from a unified scheduler? I actually don't think we're going to ever have a unified scheduler. I do think that the placement service, and when I say a unified scheduler, I'm talking about something that can handle sorting, weighing, and all sorts of crazy pants filtering, right? So what I do envision is that the placement service will be tracking, managing, inventories and allocations for 90-something percent of the types of resources that we're concerned about, including nested resource providers, shared storage, and the relationships between things, affinity and anti-affinity, by the end of Rocky. I think that will be feature complete. And Cinder can be using it, right? Cinder can be writing when it creates a volume storage pool or volume pool, then it can be creating an aggregate, writing that information into the placement for inventory and we can be integrating with that. Same with routed networks in neutron land. Ironic can be using the query, it can be using the placement API directly as a scheduler. Really it's just picking the things that have capacity. But what I don't envision really anytime soon or even ever is that the wares, the things that sort that list. So placement's responsible for giving you that list of providers that meet the requirements. It's not responsible for sorting that and picking one. That's the scheduler, right? So I don't actually envision the Nova scheduler piece ever being broken out into something separately. I see the Cinder scheduler and neutron router agent scheduler, whatever. Being able to query the placement API to reduce its burden of these types of queries, but still need to sort based on crazy pants stuff. Because that's so deployment dependent. Yeah, I know. Thank you. Thanks guys.