 All right, so today, I'm giving a couple of talks about the new placement service in OpenStack. And this first talk is an overview for architects and deployers about what the placement service is, what problems we're trying to fix, and what the solutions are in the future. So just a little bit about me. I'm Jay, I've been involved in the OpenStack community for way too long now. And let's see, the script for today. We're going to talk about what exactly are the problems that we're trying to fix with the placement service and its integration with other NOVA components. We're going to take a look at the new models, the new database models or the data structures that we use in placement. And we're going to talk a little bit about the roadmap of what we got done in Pyke so far and what's on the roadmap for BEYOND. So what is the mess that we are currently trying to fix with the placement service? Four major problems that we are slowly trying to address. And it all really comes down to technical debt that is accumulated inside of NOVA over the last seven or so years. Some of the biggest problems that we've addressed first is each type of resource in the system tends to be tracked and managed and represented differently. And that means a whole lot of spaghetti code gets introduced into the system. A lot of conditionals, a lot of you read through the code and you literally just shake your head. We have some resource types, resource classes, like disk, that are surprisingly inaccurately reported. And I'll get to an example of how that looks in a second. So we're trying to make the reporting of usage and capacity much more accurate in the system. We've been coupling together things that describe quantities with things that describe qualities, attributes. And that has led to quite a bit of mess in the code as well. In addition to coupling quantitative and qualitative stuff together, we've also coupled inventory or capacity type information with usage information for some types of resources. So all of those problems have led to duct tape and chicken wire solutions all over NOVA, some of the areas that are most messy. NUMA, and I shouldn't say not necessarily messy. Some of them are definitely messy. But it's more just the inconsistency that is the issue. So NUMA, the PCI device handling, ironic and atomic or indivisible resources, differences between vert drivers. This is a real pain. Libvert and Hyper-V and VMware, then they track resources slightly differently. The quota system is a real mess, partly because it represents resources in a completely different way than everything else in NOVA. Custom filters and wires. So people have been attempting to get around the constraints of the inconsistencies that are inside of NOVA by creating custom wires and filters in the scheduler. And all of that new custom stuff creates quite a bit of technical debt that companies have to keep over the long term and maintain and continually rebase and make sure that we're not breaking them. Server groups, another crazy area, as well as flavors. So flavors, really big issue for me. We are trying to clean up the concept of a flavor. Some of the problems of the flavor object is the inconsistencies that are jammed into the flavor representation. So just a small example. For disk usage, when you're requesting some amount of disk, the flavor has a root GB, an ephemeral GB, and a swap MB. For some reason, I don't know why the swap is in megs and everything else is in gigabytes. Boot from volume, if you're booting from volume, the amount of root disk that you're consuming should be zero. So there's a bunch of inconsistencies and conditional code throughout the flavor handling for dealing with things like boot from volume. Ironic, the flavor is kind of useless. You're not, when you launch an ironic instance, you're not getting four vCPU and a bunch of disk space and a memory, you're getting that ironic instance. You're getting an indivisible unit. We jam these capabilities into extra specs, and that tends to be a kind of wild west free form type of thing. They're not really standardized, and it's a mess. We have things like receive and transmit factors, which who knows if they actually work. I don't think so. All right, so first pop quiz. And we've got t-shirts and Darth Vader USB keys. So who can tell me, other than the flavor, what are two ways that you can influence the scheduler's decision making? Image? Yep, image about a data. What's another one? Dan, you're a NovaCore. Come on. Other than image about a data, scheduler hints. All right, what's another one? I put a little hint in there, but you probably can't read it. Well, that, yeah, well, that's one. Also, PCI devices, right? If you set up a port in Neutron, and it's decorated with a bunch of port brining information, that gets created as a PCI device request, and that influences the scheduling decisions. OK, so you get a prize. Here, come on. Choose your shirt or your USB key, whatever you'd like. All right, so that's a little bit about what we're trying to address, some of the mess. Here's the new direction that we've been taking with resource tracking. So the placement service. Couple big things about it. It is a really, really lightweight REST API, essentially. It doesn't have any RPC or messaging stuff in it. It's usable with Apache or Nginx and all that kind of stuff without any event lit mess. And importantly, it has a global view of resources. So in the new Cells V2 system, which I'll get to a little diagram of how this all works, but in the new Cells V2 system, you've got cells that comprise a subset of compute resources within NOVA. Each of those cells has a NOVA database that has information about the resources within that cell. The placement API has a view of the entire deployment. And it's also not specific to just NOVA, as we'll see in a second. Just a little background on intercomponent traffic. You've got the NOVA API, which passes requests and information to, we'll call it the scheduler. It's a little more complicated than this, but it's easier just to put one little box on the scheduler or on the chart. The scheduler is going to communicate with the placement service and then pass information back onto the NOVA conductor within a cell after it picks a direction to send things. The compute nodes are going to be updating the placement API about information, about inventory capacity, allocations, that kind of thing. We'll get more into this in a second. So how do you deploy placement? We have spent quite a bit of effort to make the deployment of placements as simple as possible. Generally, there's a separate package for the placement service. So even though the placement code currently lives in the NOVA tree, there's a separate package. And it's fairly decoupled from everything else in NOVA. So once you've deployed that package, there's a couple things that you need to do, and this is for Greenfield. You need to create the placement endpoints and service records in Keystone, which you have to do for any new service that you add into the mix and make sure you run the API Database Sync to bring your database schema up to Snuff. You all seem to ensure that all of the options in the placement section of your NOVA conf are set properly. These aren't onerous config options, and there's really nothing that you need to lick it, look in there, and tune. It's really just a description of where it can find the Keystone service catalog and authentication information. So it's a pretty self-explanatory. So how do you upgrade from Newton? The one big difference is that in Newton, if you restarted your compute nodes and it couldn't connect to the placement service or couldn't find it or whatever, it would be OK with that. With Newton, I'm sorry, with Okada, when you restart your NOVA compute nodes, if the placement information is either missing in the NovaConf or incorrect or invalid, it will fail to restart the NOVA compute service. So like I said, you'll want to make sure that you're putting all the information and the config variables in the placement section correctly, restart it and make sure it's going to stay up and connect to the placement service. You'll see that in the log lines. We also have this new health checky, what do we call it, pre-flight check, tool called NOVA status. And if you do a NOVA status upgrade check, part of the checks that it runs are making sure that the placement service is being populated correctly and all that kind of stuff. So there's some good docs there for how to run those pre-flight checks if you're doing an upgrade. All right, so let's take a look at the new data model that the placement service provides. We have a number of major models, resource class trait, resource provider, inventory, allocation, and aggregate. Resource classes, these are things that can be counted. They only have integer amounts, so you can't have half a CPU or 2.5 meg of RAM. There's standard resource classes, which are really anything that is listed in the NOVA source tree, and you can do a get-forge-slash-resource classes from the placement API to see which resource classes are available. Anything that has the prefix custom underscore is a custom resource class. An example of custom resource classes would be an Ironic. I can't remember the release that Ironic added, the node resource class. Okada? Newton. Okay, in Newton, Ironic, when you do get nodes, the response includes a little resource class field, and that indicates the custom resource class that represents that class of hardware in Ironic. So that's an example of a custom resource class. They can be defined by cloud admins. The ones, the custom resource classes that we're currently working on integrating are the Ironic ones. So traits, these are things that describe the provider of a resource. They don't describe the resource itself. It's a little bit different. We have an OS Traits Library, which lists all the standard traits. Things like x86 CPU instruction set extensions, and things like whether or not a storage drive is SSD. There's a whole bunch of them in there. Anything that is prefixed with custom underscore is a custom trait. So cloud admins can create traits and associate those with resource providers. So flavors will have a set of required traits, but when you launch an instance with that flavor, it'll go and ask the scheduler, hey, find me a bunch of resource providers that have these traits associated with them. We haven't fully wired up all of this yet, so this is kind of a proposal of what I'd like to see it look like, but anyway. Resource providers, thing to remember, they are generic. They're not specific to compute or nova. These are really just anything that provides some resource is a resource provider. The resource provider is going to have one or more inventory records. So each resource class that the provider provides will have a separate inventory record, and it'll be associated with one or zero or more traits. All right, so a little bit of a deep dive into the communication that's happening from the scheduler to the placement API. So right now, in Okada, the scheduler winnows down the set of compute nodes that it will then run through its set of filters and wares by first calling to the placement API, hey, find me the resource providers that have capacity for this set of resources. So the placement API then comes back and says, okay, these 10 resource providers meet your requested resources, and then the scheduler then runs those resource providers, which are compute node, UUIDs, through its own filters for things like pneumatopology, PCI pass-through filter, all those kind of fancy things, and then sorts them with its original wares. It then picks one of those and then sends the launch request down to the compute node. So right now, the question that the scheduler is asking the placement service is, hey, get me a list of these resource providers that meet a request. That communication is going to change a little bit in Pyke as we move toward something called claims. So, second-pop quiz. Are all compute nodes resource providers? Okay. Are all resource providers compute nodes? Mike gets a USB key. Or would you like a... Okay, here we go. Well done, Mike. Yeah, so, yeah, I'm going to go back here. So the point of this slide is that, remember, the placement of the resource providers so the point of this slide is that, remember, the placement engine is generic. So although all compute nodes are indeed resource providers, they provide things like vCPU and memory and that kind of thing, not all resource providers are compute nodes. So you can have a shared storage pool. You can have a routed network which is passing out IPs, or an IP allocation pool could be a resource provider. So the point about the placement service is that it is generic to the types of resources that it's modeling. Are inventories? They're keyed by resource provider and resource class. So for each resource provider and resource class, you've got some information about that inventory, total or reserved, the minimum unit that someone can request of that type of resource on that provider, the max unit, called step size, which we use to make sure things are sane in the request process, and then also allocation ratio. So this is how you provide for overcommit of a particular resource. I should mention before I go on. So the inventory records are currently, when the Nova Compute worker spins up, it asks the vert driver to get its available resources. And the Compute worker will report back into the placement service, hey, the hypervisor says I have this much memory, I have this many, vCPU, et cetera. So that's what's currently occurring. Allocations. So allocations are keyed by resource provider, resource class, and then consumer. We called it consumer instead of instance because, again, we want this to be generic, not specific to Nova, not specific to Compute. When you do a Nova boot, you'll be able to make a claim on a set of resource providers in one set of allocations. So this is important. Let's say you are requesting a flavor which gets four vCPU, a gig of memory, and 500 gig of disk. And your Compute nodes don't have any local disk that they provide for instances. There's a shared storage pool that provides some disk space. When we're making the claim for resources, we're going to be claiming the disk from the shared storage pool, and we're going to be claiming the vCPU and memory from the Compute node that gets selected by the scheduler. We need to do that in one atomic transactional way. And so when you think of like, oh, it's a claim, it can be against multiple resource providers, not just one. So here's just a little diagram of the Compute node to placement communication. The things that the Compute node is telling the placement service is, hey, do I have any update in my inventory for any of my known resources? And also, right now, the Compute service is saying, oh, this instance is consuming this amount of resources on me, and it's communicating that information to the placement service. And finally, the one other model in the resource provider's data modeling is an aggregate. An aggregate is simply a collection of resource providers. It's very important. There's like third bullet here. There's no connotation about geography or physical location or anything like that. An aggregate does not mean it's a rack or a row or a cage in a data center. There's no physical or geographical connotation to it. It's just a group of resource providers. The aggregate doesn't have inventory or allocations associated with it. The resource providers in that aggregate do. So, again, it's merely a grouping mechanism. All right, so let's talk a little bit about the roadmap for placement, what we've gotten done so far in Okada and Pyke, and what we're moving on towards. The claims in the scheduler. This work is currently ongoing. We're aiming to have it all done by Pyke, or at least the most important parts of it done by Pyke. What this does is instead of the Compute node making the claim and writing allocations to a couple databases, instead, which introduces a fairly lengthy interval where retries can occur. When there's contention for the same last slot of resources on a Compute node, that can trigger a long set of retries, which is fairly expensive. Instead of doing the claim for resources on the Compute node, we're going to be doing that right of allocations in the conductor, or maybe schedule, conductor, right? Okay. And dramatically reducing both the chances of retries happening for various reasons, as well as reducing that time interval in which a retry occurs. So this is ongoing and should be completed by Pyke. Shared resources, this is also a Pyke thing. The canonical example here would be a shared storage pool where you've got Compute nodes that don't have any local disks that they supply for your instance disk images. Instead, all of that disk space is coming from an NFS share or an RBD pool, right? We need a way of representing to Nova through the placement API that, hey, this chunk of resources is shared with this set of providers. And the way that we do that is via an aggregate association. So that's something that we're going to land in Pyke. That will correct the inaccurate reporting of disk in the system. So right now, if you're using an NFS share, and you have 100 Compute nodes and they're all connected to that NFS share, the system is going to be reporting 100 times the capacity of that NFS share. And again, this is because of just inaccuracies that we have in the system. Tenant and user awareness. This is a big thing. So we're trying to get the consumer, which again is generic, associated with a user and a project. This will allow us to replace a number of things like the OS simple tenant usage command and allow people to get a very quick and efficient view of resources in the cloud. Nested resource providers. Anybody interested in SRIOV and NFV stuff? Yeah, it's like half of the talks here on that. Anyways, a big thing with the nested resource providers modeling is handling pneumatopology and SRIOV physical functions and virtual function relationships. We're going to try and do some work on this in Pyke, but more than likely it'll be completed in Queens because it's dependent on a couple other things. If you kind of my talk this afternoon, we talk a little bit more about this. Notifications. We'd like definitely the first one, the outbound, we'd like whenever an allocation is made or a new resource provider created or a change in inventory to trigger an outbound notification so other services in the open-sac ecosystem can consume those notifications and update their state locally. We don't currently have anyone working on this or signed up for it, so if you're interested in contributing to NOVA, this is a great place to start to come find us on IRC. Affinity and anti-affinity. Cleanup. Or rather, just make it work. That's supposed to clean up. The current system that we have called server groups is really kind of inflexible. We'd like to move to a much more generic modeling of scheduling and placement constraints. This, what I'm saying here with NOVA boot near and not near and all that kind of stuff, is just kind of a strawman proposal of what the user experience might look like. But that's something that we'll probably get to in Queens and Rocky. And then finally splitting out the service into its own repository and fully detaching from NOVA and make it really the generic placement and resource-tracking service for other OpenStack components. So that's the big long-term goal, is to have all that stuff split out. Anyway, so I've ended a little bit early here because I wanted to be able to ask or have folks ask some questions. So thank you very much. Hey, Jay. Hi. Can you explain a little bit the relationship between or even among aggregates and server groups and even cells? How do they interact and how do they... Sure. Why do we need all three of those things? So aggregates are merely a grouping of resource providers to the placement service. There is the host aggregate in NOVA, which has a few other connotations, including decorating that host aggregate with metadata that comes into play. So the aggregate in the placement service, again, is generic. It's literally just a grouping mechanism. It's not a way to decorate with metadata some set of hosts. So it's a little different. Cells are something that is not visible to an end user. They're merely a scaling and failure domain for a set of NOVA computes, which is matched to a NOVA database and message queue. So it's a chunk of compute. Server groups are the method by which you currently are able to do affinity and anti-affinity placement policies. So you create a server group, specify the policy as affinity or anti-affinity, and then you launch instances specifying that server group as a scheduler hint. The problem is that you can't really update the group or delete members of the group over time, so it can get really janky. So server group is visible to the user. Host aggregates in NOVA are visible to the admin. The cell is not visible to anyone other than basically the super admin, the deployer. It's not really useful for anything other than internal NOVA communications. Does that answer it? Okay. You mentioned traits earlier and showed that there were traits like the CPU instruction set. Is there going to be working there to automatically assign relevant traits out of the standard set? Yes. So an example of this, so Jianghua from Citrix, we have a VGPU spec that's kind of ongoing. One of the things in there is like, oh, how are we going to handle when Zen server boots up and it's querying for information about those VGPUs and decorating the resource provider objects with a set of traits? You know, we're going through that discussion of how that all should happen. But yeah, on startup of the NOVA compute, we'll decorate with some traits that we were able to discover locally, and then the admin will be able to specify any custom traits as well. So trying to get it as automated and sort of auto-healing discovery as possible. Yeah. All right. So I've got a quick question regarding aggregates and how they work relating to resource providers and such. Just keep in mind here, I'm talking with like a Newton architecture. That's what I run right now. Okay. So let's see. Let's say we have like a few computes in an aggregate. Like aggregates are actually exclusive to NOVA or do they actually have an influence on what kind of resources they can get from other services like Cinder, for example? Very good question. So the vision is that, again, an aggregate is just a generic grouping of resource providers. So if you have a resource provider that represents some shared storage pool, you would associate that resource provider UID with an aggregate. You can, you know, whatever aggregate you want. And then associate that aggregate with a set of compute hosts that that storage pool would share with. Okay. So that's how that works. In that case, can you currently have a pool that is shared between different aggregates? A pool that is shared between different aggregates. For example, I have my SAN configured in Cinder and I have like two aggregates. But I want these aggregates to connect to plug devices in the same resource, in the same Cinder, like in the same SAN. Yeah, aggregates overlap. Yes, they can. There's nothing preventing that from happening. All right, thank you. Yeah. That would be a really cool test case to add, actually. So find me afterwards. My question is to do with the placement API with Banana or without. So what kind of algorithms are you going to have to determine the best placement? And is that actually also going to be extensible? I may have, in some cases, depending on the workload, I may have all sorts of strange combinations of dependencies. I need some accelerator here. I need this type of storage there. Or I may have power cooling, whatever consideration I have. So I think that the placement is one of those areas where we could really achieve these kind of efficiencies that really affect the bottom line of UT's total cost of ownership. So I wonder what we're doing there. My plan right now is to keep the placement service as simple as possible. And when a request is received where we can say, oh, this is going to take some work to figure out, then we can pass it off to a different engine either in the scheduler, not the placement service, but the scheduler or somewhere else. The placement service should just be answering, give me the set of resource providers that have capacity for this requested set of resources. It shouldn't be taking notifications from Watcher or something else about the thermal conditions in a cage. Those types of factors can be placed in either a database or some sort of other API that then a set of wires or other filters in the scheduler can maybe call out to. We haven't at this point said that we're getting rid of scheduler filters or wires. We're trying to keep the placement service as simple and consistent as we can. So what the placement service will respond is which particular resource providers have available Accelerator, VCPU, SYZ. It'll return those set of resource providers. The Nova scheduler will then have the opportunity to pass those set of resource providers onto more custom filters, which can then take whatever logic, frankly, that you'd need. Does that make sense? Okay. Matthew. So I've got two kind of related questions about aggregates and how they might relate to shared storage pools. So currently in Nova, we have some code which can automatically detect whether or not two computes are using shared storage. Would you anticipate that the compute would automatically be able to create an aggregate and populate it? Or is that going to be entirely an operator action? I would prefer that to be an operator action and not something that we try and figure out. Right now, the resource tracker and the scheduler reporting client is technically the name of it. It does know which aggregates the compute node is associated with, but it's not going to then try and go and contact each of those aggregates and then figure out if there's, you know, a shared pool and if so, then, you know, reconfigure its local inventory or anything like that. Okay, cool. I mean, I'm ambivalent about that because the code's kind of janky, but... That is something that I feel should be done by a cloud admin or a config management system or inventory management system or something like that when you change the inventory of something locally or you say, okay, this compute node no longer has local disk. Instead, we're going to use this sand and I think that's something that really belongs in a manual explicit step that the cloud admin would do. Cool. So the related question is how canonical do you see the placement as being in identifying this storage as being shared between these two servers? So, for example, during the live or cold migration flow, can I completely ignore that janky code and just ask placement if I'm on shared storage with the target? That's a good question. I haven't thought about it, but yeah, it makes sense to do that because the placement service has that information. It knows the relationship between one provider and another. Yes? Mr. Burgers? By the way, time check? What? Two minutes? Three minutes? Okay, cool. Thanks. You had a slide up there. Man, I am short. You had a slide up there that talked about being able to use the OpenSec client to define a flavor and say required traits. And it was way back there somewhere. And you said that's coming, but then when you talked about what was happening, it wasn't on there. So ideas about when you think we're going to get to the point where flavors can list required traits to get past a placement. There you go. When we're going to get to the point where... The requires and trait would actually work. No, this is just a strawman proposal of what we tack on to the flavor create command to specify. Basically, right now you've got these extra specs things, right, which are, who knows. And we keep expanding it because... Yeah, so I'm not entirely sure when we'd integrate this kind of thing into the form. Is there a spec yet on it? Or is this just more still... It's part of the resource provider trait's spec, but it's not something that we've got code up yet for. Yeah. I have three questions. First one is about placement migration. So once a VM is created, if I want to do a more migrate, will those placement rules will be still haunted by the scheduler? When you do a migrate, the scheduler... The conductor is going to call the scheduler to find a new destination host. Okay. So the placement policies will still be... And the scheduler is going to talk to the placement service, which is going to respond back, saying, hey. Second question is about placement checking. Like, before I spin up, can I do a check with your API? Very good question. So right now the placement service for, like, get resource providers is admin-only because it's essentially service-to-service communication. I would love to get to the point where we can expose some information back to the user where they can say, basically, hey, if I launch something with this request for some resources in these traits, do you have something there? But right now we return a bunch of, you know, resource provider UUIDs, and that's not necessarily something we want the user to know. The user in a cloud shouldn't know which compute host they're on, right? Yeah, it would be an admin operation to just check whether I have resource available for, you know... Yeah, I mean, that's essentially what the scheduler does, right? Right now. The third question is about placement policies. Right now we have only affinity and anti-affinity in NOAA. Like, what if I want, like, an exclusively group? I don't want any other VMs to be created on the host. Is there any option we could extend your API as to how that kind of policies? Talk to me afterwards. Sure, thank you. I'd like to keep it as generic as possible, but I've got some ideas around that. Sure, thanks. Yeah? Something that I'm just a little bit curious about is the host aggregates and aggregates and the placement API, they two separate things. Yes. And is one ever going to replace the other or are they always going to exist because of views? Yeah, the UUIDs are the same. They're mirrored. But the host aggregate in NOVA you can attach some metadata to which we're not supporting in the placement service. I think eventually, yeah, we'd want to get rid of the host aggregate in NOVA, but haven't quite gotten... There's some things around availability zone that are really weird with the host aggregates in NOVA. So, yeah, I think eventually we'd like to... Oh, sorry, were you asking whether you needed to... Do I need to create... You don't need to do anything. No, once you create a host aggregate in NOVA, it's automatically going to create a record in placement for that aggregate. What are the different ways the placement decisions are happening today? One is using the hint and you mentioned PCI devices. Yeah. Is the goal is to bring it all at the end to the required thing that you have in mind? Yeah, I would love it if we can standardize on that kind of thing. We don't want to support Scheduler hints long-term because they're kind of janky, but it would be great if we can have a consistent way of representing the requested things and standardize them as much as possible with the OS trains. Yeah. So, with that note, so, would I be able to ask for, okay, I want to boot an instance with, say, two VGPUs. Would I be able to specify the quantity? Yeah. Yeah, so you'd say... Well, you'd select the flavor which would have two VGPU requested resources. Yeah. Last one. So, this may be... If nobody else has any questions, it may be a little frivolous or possibly off-topic, but I'll ask it anyway. I keep seeing, will only make you move to this once all of the compute nodes can support it? How in the hell are you going to know that? So, the compute nodes have a service version? No, I got that, but then how do you... Like, if one goes down, how do you know if it's ever going to come back up? Like, if it was last down at a different version, you don't know if it's ever going to come back up and now all the rest of the ones that are up are fine and so you go to move to this thing, then that one comes back up and it's the wrong version. Like, how do you solve all of the... There's got to be a hundred different permutations that are going to screw you there. Okay. But you got it all figured out. Awesome. Well, I don't think we have everything figured out, but Dan does. Yeah, yeah, yeah. That's why I don't want to make sure nobody else had any real questions before I... That's cool. All right, thanks. Thanks, guys.