 All right. My name is Dan Smith, worker Red Hat. I'm going to talk about scaling Nova with Cells V2. So if you have a deployment, your deployment probably looks something like this. You've got API nodes, which are the ones on the left there. Hopefully many more computes than that, which are the ones on the right. And then you've got probably some nodes that are dedicated to running things like your database and your message queue. These nodes are easy to identify. They're usually glowing red hot. If you can't identify them that way, the message queue is the one that's on fire. So the goal for us here is to address that problem. You might also, if you're much larger, have something that looks like this. You still got API nodes on the left. You've got probably many, many more computes on the right, but you're running what used to be called in Nova just Cells, which we now call Cells V1. And in that case, you have sharded out your database and your message queue a little bit such that some grouping of your compute nodes use one database and message queue. Another group of them will use a different one. And then you've got this top level message queue and database, which all kind of gets a merged view of the sharded bits. And in order to do that, we've got some extra services in Nova, called the Nova Cells service, that basically performs replication in Python of replication and merging of the things that are in the smaller sharded databases up to the top level one. And this works for some definition of that word. And obviously, it's working for some people, but it has a lot of issues. So I wanted to give a little bit of a graphical abstract example of what that kind of looks like. And hopefully, it will be clear why that's a problem. So if you look at some of the big services in Nova, you've got the Compute API, you've got Conductor Workers, and then obviously Computes. So the red line here is kind of how a boot request comes in from the API and eventually ends up booting something on a Compute node. Comes into Compute API, Compute API does a little bit of early work to return you something as an API consumer. And then it asks Conductor to do some of the longer running things, like asking the scheduler which compute that should go to and various other things. And then finally, Conductor, once it has selected a Compute, tells the Compute to get busy. Other requests come into the Compute API and go straight down to Compute. For an instance that is on a Compute node, we can look that up and make the call direct. So in Cells V1, you've got this API level Cells service. And what I'm going to show is how the red path looks kind of in Cells V1. So we hijack the request quite literally in Compute API. And we send it over to the Nova Cells service. Nova Cells service ends up calling down to the Cells service in the child cell, which does some stuff in that database, usually calls back a couple hundred times to Nova Cells at the top layer, which then ends up finishing the request like we would have normally to the API consumer. And then the child cell calls to the Compute API that is the same Compute API up there. But it does it again in the cell, which then will call to conductor, which calls to Compute. So this gray path is really problematic because it's totally different if you're running in Cells V1. So we've got some problems, both with the flat deployment and the current Cells V1. The database and the message queue have to scale with a number of Compute nodes if you're flat. That's a problem. Many people I've heard talking about various things scale related here have identified that as a huge problem. It's usually the message queue that gives up first. We also have some reliability issues in that if you have that single flat deployment, you are dependent on keeping that one database and the one message queue up for anything to work. And so you have no alternative other than HA and Pre for those nodes. And especially Rabbit, when you put it in that mode, the amount of traffic it's able to handle comes way down when you're replicating everything in a Rabbit cluster. Another big problem if you're on Cells V1 is there is no future. Cells V1 is barely tested. It's rotting in tree. It doesn't support all of the features of Nova. And when I say all of the features, I mean it doesn't support most of the features. You could make the argument that it doesn't support flavors. You kind of have to handle flavors on your own in order to make it work. It's a real problem. So from our perspective, the developer perspective, the different code paths thing is just not supportable. And it's why we don't really support Cells V1 right now. It's completely alternate code paths. Almost everybody is not running through that alternate set except for the big guys, which we would like to keep online. And then another problem is we like to have a consistent API, ideally. And there are really scary things that behave totally different in Cells V1 where you expect a lock operation on an instance to have completed when it returns. And it does in the flat deployment and it doesn't in Cells V1. It's really bad. So what we need to do is build the scaling approach into the core of Nova so that everybody is running the same code and that Nova itself can scale without bolting this thing on. So the approach that we've got with Cells V2 is to still shard the message queue and the database along whatever lines you need to kind of make that reasonable. But instead of replicating everything up and merging it into a top-level database so that the API side of Nova doesn't need to know that that's going on, we just need to teach the API that Nova can be sharded natively instead of the bolt-on approach. So what we've got in Cells V2 is teaching the API that instances are on a compute node. A compute node is in a cell and a cell is associated with a database and a message queue. And so when we go to take action on an instance or even on a host, we look that up and just as part of the normal path, we connect and send it to the right location. So one really nice thing about this is that if you're a small deployment, this is just the flat, right? This is the flat approach that you had before. You've got one database, one message queue, one set of compute nodes that use it, and all the APIs point at it. And if they already know how to do the switching and there's only one thing to switch over, it's exactly the same for all intents and purposes. It makes it really easy to grow and just add another one of those things, put another record in the database that says I now have this other grouping over here, and move on. And in Cells V1, you really have to have planned from the beginning that you're going to be large enough to run Cells V1. It's the reason that people don't run Cells V1 if they're only a cell of one because they lose a bunch of features. They have to run a bunch of extra stuff. They have all the problems and none of the benefits. And if we build it in to the native core, then growing naturally becomes much easier. So some benefits of this, there's always benefits and always caveats. We have a little bit better story around fault domains. We have a database and a message queue that are critical for the compute to continue operating and for requests to come into those computes that are organized, sharded, and separate. That means that if you lose one of those database and message queue servers, you might lose the ability to talk to a bunch of computes. They might lose the ability to update some of their accounting information, but the other groups should be unaffected. In order to do all of this, we still have a small database and, at the moment, a message queue up at the top. But it's only used really for recording indexing information, which is where do the things live? It's really small. It's really replication-friendly, fast, and easy to HA. It's so much less data than in Cells V1 where you have merged your entire cloud into one database at the top. Performance-wise, you've got fewer computes per message queue and per database, which means the message queue and the database server don't need to be as capable because there's just lower traffic. The memory usage is lower. And if you need to HA those, you can do that at a much more sane level because you just don't have the traffic. If you make the message queue HA, then the amount of traffic that you can handle is so much lower. But if you split that out so that it is lower because it's a smaller set of computes, then it's more reasonable to do that. And then there's some organizational benefits. You might decide that you've got some very expensive compute nodes that you bill at a much higher rate that do have HA database and HA message queue. You've got big machines to run those things so that you can run them in HA. It's expensive, but people pay for it because they know that the availability of accessing their instances that are running in that cell is very good. On the other hand, you may have some cheap computes that you don't HA the MQ and the database. And maybe these are your dev nodes, your pre-prod, something like that. Or you're just selling them to customers at a lower rate because they cost less to run. And that becomes possible if you've sharded this out. Whereas if you've got the flat deployment, if you want any of it to be HA, it all has to be HA. It also means that you can bring on a new grouping of compute nodes as a cell in whole. You can actually kind of set up a cell of compute nodes, put a temporary API in front of them, do all your testing and burn in, and then kill that off and just plug that cell into the main deployment. The cell is independent of the rest of the deployment enough that that's a totally reasonable thing to do. And unlike cells V1, our grouping structures like aggregates and server groups, sorry, Jay, are global, which means that you can define aggregates that stay within a cell, across a cell, or whatever else. So that the users don't need to know that there are some arbitrary restrictions due to the cells layout that you got on the back end. Benefits over cells V1, single scheduler knows about all the nodes. Some people don't think this is a benefit because they currently have scaling issues with a scheduler right now, and that's obviously totally reasonable. We're hoping with placement to make scheduling decisions fast and atomic to the point where running one scheduler for your whole deployment is a benefit because the scheduler has visibility into affinity or anti-affinity restrictions across the entire deployment. And right now with cells V1, you schedule at the top to pick a cell. And then you schedule again in the cell to pick a host. In cells V1, you cannot have cross-cell migrations. Once you've ended up in a cell, you're there. In cells V2, it is totally feasible for us to add the ability for you to be able to migrate an instance from one cell to another, and that may be for upgrade reasons, or you want to move something into the HA cell or out of the HA cell. Performance-wise, we have no replication in Python. So we have a lot less message traffic flowing up and down to synchronize data from this database to that database. We don't have duplicated data where your top-level single-cell API cell database is a merge set of everything that has ever been in any of the children. And since we're not duplicating and syncing these things, there's nothing to get out of sync. Whereas with cells V1, almost everything can get out of sync. So that was a real important tenant here is to try to get away from these cron jobs of operator-contributed scripts that are constantly cleaning up the mess that Nova is making behind itself. Huge benefit for us as developers as well as the operators is the fact that this is mainline development. Running cells V1 right now is a huge liability in that there are very few actual deployments in the world that are running that code. And unifying this path so that everybody is running the same set of code regardless if they've got one cell or 20 cells means that everything is more reliable. There's much less code to maintain and much less code to get broken. And that whole all cells, all deployments are a cell of one is a primary driving factor behind all of this. So in that vein, I feel like there are business people here that dictate at least one graph to be in each slide deck. So here's mine. Originally, we had very few people running cells. And then we had more people running cells. And after Okada and Pyke and Queens, everyone will be running cells regardless. They just might have one. So there's your camera, your shot, your picture. All right, so this is how the services are arranged in Cells V2. You've got at the top level, you've got your REST API nodes. You've got a small, easy to HA, mostly read-heavy database. And a very small MQ, which really just serves one very minor purpose. You've got the scheduler or schedulers. And you've got a conductor, which serves also very few purposes. Since it's at the top and we could, we now refer to this as the superconductor. In each of the cells down below, you've got your regular database where all your instances are stored, all of your metadata for each of those things, all that kind of stuff, as well as the kind of conventional MQ that you use for compute talking to conductor and compute talking to compute during a migration, all that kind of thing. And that is replicated, of course, in each of your cells. I put placement on here because it's a hot topic and people are very aware that this is going on. Right now, placement does live at the top level, but it is designed and incubated in Nova to be kicked out of Nova. And so I put it outside that box, even though just to show that it's at the top level. And Scheduler uses placement, as we'll see in a minute, to help pick stuff in the cells in a very efficient way. So for a boot type operation, you've got API taking the request from the user, and it creates in the top level database really just a record of the fact that I've told the consumer, the API consumer, that I'm going to create an instance, and I told them that the UUID is this. And it asks the superconductor to do that long-running process that a boot requires, which talks to the Scheduler, talks to placement, and various other things. Superconductor picks a compute node, not a cell. It picks a compute node which dictates which cell it's in, and it calls directly down to the compute to start the build just like it did before. It's literally just the same conductor code but aware of all of the nodes calling to compute as it did before. Unlike in cells V1, the API can also still call directly from the top level all the way to the compute for things that would go direct normally, things that are taking action on an instance that exists in a host, which exists in a cell. So we've got to change things in NOVA, obviously, to make this work. There are certain data structures. Everything was all in one database to begin with in the flat deployment. Some of those data structures are necessarily global, and therefore they get moved out of what used to be the main database and what is now the cell database up to the top level database. So these are things like flavors, aggregates, and key pairs. And again, like I said, these don't change very often. Hopefully you're not creating flavors at approximately the same rate that you're creating instances. And users can create key pairs, but it's not super heavy and even still key pairs, a very small string. So we have to move those things to the top so that the API can access them. And then we bundle the things that need to go with an instance, with that instance, when we make the call down. So it used to be that we referred to the flavor that an instance was booted with by ID, which means they have to be in the same database, but it also had other restrictions like you could change the flavor later. And then the instance is claiming to be booted by a flavor that does not actually represent what it is. And so that has another kind of general benefit of keeping that information with the instance. But by doing that, we can then move the instance down into the main database and not have to have this linkage between the flavor that lives at the top and the instance that lives in the bottom. And that really was the reason that flavor syncing had to happen in Cells V1. You had to have the flavor in the top level database, because you had all of that mirrored and linked. And you had to have the flavor in the bottom database, because it was also down there and linked. And anything that you had to do had to load the linked flavor. And so if you've ever run a Cells V1 deployment, you know that you've got to keep those things in sync or bad things happen. In the top level database, in addition to the index kind of information of where all the instances live, we also have a list of the cells. And a cell is really just defined as a UUID to uniquely identify that cell. And the connection information for its database and its message queue. That's all the more data structure we need to identify something as a cell, as a grouping of computes. And it really is just a whole nova deployment of just those services that all are talking to the database. And the API nodes use that connection information to go connect to that MQ and talk to a compute node or write something into that database. As a result of this, we have to have a little bit of a discovery mechanism to constantly build that index. Not constantly, but anytime you add a new compute node at the bottom layer, we build an index entry at the top that just tells us what cell that node is in. We got to teach all the code about the shards. That's the API bit knowing to multiplex. And then we also have to performance optimize code that deals with multiple cells. So before, when we would do an instance list, we would do a select out of the one database because everything was in one database. For however many they asked for, we would load that out of the database. We would out of memory nova API and move along our merry way. But now your instances could live in all these different cells. And so it's a little bit more complicated to do things like a list all instances operation because you've got to handle the fact that they live in different places. So this is kind of just a sequence diagram of what that looks like. And I'm showing you this because it's important to note that not all of these things will be fully performance optimized right away and may be more linear than we want. However, this is what it should eventually look like. If you come to list your instances, you look up the mappings for all the instances that you own. That tells you which cells you have instances in. Now if you have aggregates and restrictions set up such that for whatever reason, I think CERN keeps all instances for a particular tenant in one cell maybe or something like that. So that could be that you only have instances in one cell, but if you don't have such a restriction, the department foo tenant might have instances in two of the 15, 16 cells that are in the deployment. So we get that list of cells and we can parallelize smaller queries to each of those databases so that we're doing them in parallel. We're doing smaller ones which require less memory on all of the nodes and then hopefully interleave them in such a way that you get a unified response that is hopefully sorted and paginated the way you wanted and maybe even quicker than it was before with a smaller memory footprint required. And there are other requests that will require us to do this, but list instances is kind of the big one. It's common. You have the ability to generate a lot of data and database traffic by doing this. So those are most of the good things. There are some caveats. Most of these are things that are issues that we've identified that we will have to work through and eventually come out of. Other things are going to slightly change how things behave permanently. I said not all of those queries are gonna be optimized right away. Hopefully listing instances will be maybe reasonable in Pyke which would be really great. There will be other things like we have some admin operations where if you need to list all of your services or take action on a service, we do a little bit of kind of linear scan of things, but those are less frequent operations and hopefully not a huge deal and there's plenty of room for optimizing those. There's a little bit of added complexity for the small deployments. It's really not a whole lot. You don't actually have to run any more services like you would if you were running a Cells V1 deployment of one cell, but some of that index information needs to be created when you roll out your deployment and when you add compute nodes, there's a little bit of extra stuff that has to do, but we've got multiple things already to make that a little bit easier as well as some automated ways to build that if you're kind of at a small enough scale where it's worthwhile to have that always running in the background for convenience factor. The gotchas are few, but will require a little bit more work to fully resolve. One thing that we're really trying to avoid is the cell knowing anything about where it fits in that topology and that makes it very convenient from an organizational point of view where you could detach a cell from a temporary API deployment and attach it to your production as you bring things online, but there's also kind of security and delegation of access to global data reasons. A lot of people that have run Cells V1 in the past have firewalled off the cells from the top layer. Kind of everything flows down, a little bit flows up through that message queue because you have to with Cells V1, but what we really want is to try to keep everything moving down and not require this up call and unfortunately, scheduler retries which are too common in large deployments would technically require an up call. However, J is gonna fix that, so we don't have those anymore, especially for resource related reasons, which is 95% of most of why everybody has reschedules is because the scheduler made a bad decision eight times quickly and a whole bunch of things rushed at the same compute node and you use reschedules to basically paper over that fact which is a really expensive way to keep things running and so hopefully with placement, we will no longer have nearly the number of scheduler retries such that we can handle the few that we have for unexpected failures with a more constrained model that fits within our topology. Another one is the late binding affinity check that we do which is also, I'm quite sure that there's a note Russell B, this is a total hack comment above that line but it was added in Ice House I think when we added server groups and basically the scheduler makes bad decisions constantly and so even if you've asked the scheduler, I want affinity or anti affinity to this other instance. Once you finally reschedule your way to a compute node, you may be violating that policy and so compute calls back up to the top level and basically asks for a sanity check once it's landed on a compute node and that also can be solved by placement before we've started burning any resources and therefore that one will go away as well. There's another gotcha which is harder to resolve and that is if you lose a database and a message queue from one of your shards, everything else continues to work but the API consumer expects to still be able to list its instances and we won't be able to do that because that database has gone down, the API nodes can handle that fact but they kind of need to show something to the API consumer so that the consumer doesn't say, huh, I lost 50 instances, I guess I'll go recreate those and then them have them pop back up once the server comes back. So we've got a little bit of index information at the top, we can at least show the user you had an instance and here's the UUID and when it comes back it will probably have attributes that we don't really know about right now. We can totally make that much better obviously with just some kind of standard caching at the top layer which is good, everybody loves caching, right? But at least initially that's gonna be a little bit of a gotcha. Most of the people with CellsView1 deployments tend to think that if, or have told me that if a whole cell goes offline they probably have bigger issues but it is a thing so. Okay so this is the service arrangement again, I just wanted to revisit this. This is where placement kicks in. At the top layer, scheduler calls to placement which ends up having all of this topology information, well not topology information but all the information of everything below it in the topology so that you can pick compute nodes very well and this is what an up call looks like. So this is what we're trying to avoid which is something in the lower layer needing to call back up to something at the top layer and it's security reasons but also we don't want the lower layers to have to have credentials and connection information for the things at the top and we don't want them to know where they fit into the arrangement. And then I wanted to point out that if you've got, this is a multi-cell service arrangement which looks much more complicated than a flat small deployment if you've got one but really these are the only pieces that are required of all of that if you've only got one cell. You don't need a superconductor. You don't need an extra message queue up at the top. You need APIs regardless. You need a scheduler regardless. You do have that small read heavy database at the top but it's just another, it can be another database in the same machine that your main database is in so this is really not much different than the way a flat regular Nova deployment looks right now. So it's really not that much more complicated for the small people and then you go from this to a multi-cell much easier. So there are migration hurdles. This is a big, people have referred to this as a new, a feature of Nova like server groups as a feature or SROV as a feature and it kind of is but this is really a total rearchitecting of a lot of internal plumbing in Nova and as such, you know, there will be bumps in the road. One thing that we did in Okada was to introduce a new command in Nova that is kind of like Nova Manage but is intended to be even more standalone called Nova status and this is intended to be a health check or a pre-flight upgrade check kind of thing that you would run the new version of the Nova status command against your existing deployment before you change anything and it should give you red things, yellow things and green things like red things, you have not done your homework and I have proof, go do it. You will break if you move. Yellow things like, well, I can't find all of these things that I think you probably should have so you might wanna check on this and green like, you know, congratulations, this homework item has been complete, you should be good when you roll forward so we're trying to do that to make it easier to step through some of the changes that are required to move everybody to this unified core. So if you have rolled to Newton, then you already have the instance mapping index part of that whole process built. If you have rolled to Okada, congratulations, you have host mappings and are using those host mappings. So that's most of everything that you need built such that when you get Pyke and the rest of the code is aware of all of these relationships, you could actually potentially split out a cell. I will point out because multiple people have hit it that new hosts getting added do require this discovery process to be run to add a new index for that host map, that host to the top level, whereas before, Nova Compute nodes were very auto registry. You just start one up and it plugs in. It might start up, plug in and get instances well before you are ready for it, but it will attempt. So you can run that manually at the end of like your deploy step, which is a really easy way to do it. If you bring on a new compute node, you just run that discovery step and it will pop one in there. And there's a couple of other ways that you can run it automatically such that it's just kind of constantly checking for new hosts and you have a little bit longer delay than the current couple of minutes or whatever before that gets made. Make sure you're up to date on your online data migrations because this whole process was moving things from here to there so that we could drop compatibility and expect separation. Now that Chet's gone, I can say, sounds like Chet has not necessarily been doing all his homework every time. Nova status will tell him, so he should be good, but making sure that you're doing that each time means that you've got less work to do when it really counts. And upgrades are gonna be a teensy bit messy initially right away if you wanna do a fully live upgrade, you're gonna have to do a little bit in each cell and then a little bit in each cell and a little bit in each cell, which is exactly the way live upgrades work right now with flat deployment, but kind of that activity will be striped across them. Eventually what we'd really like to get to is upgrade a whole cell, upgrade a whole cell, upgrade a whole cell, and then roll your API to enable all the new features that now everything underneath it can take. And especially if we have cross-cell migration, that will make it much nicer in general to upgrade chunks of your infrastructure. So Belmiro and a couple other people are thinking, what if I'm on cells V1 right now? All of this migration path stuff sounded like it was related to the flat deployment and I'm sorry to say you guys are in big trouble, but not really. So this is what cells V1 looks like again, right? We've got this replication process, we've got the sharded database and message queue. This is actually very similar to the way cells V2 looks except that you don't have this top level completely hyper-converged database with everything in it. But everything, all your data is already sharded. So what we can do is we can take away that replication process, we can take away the converged, merged, top level unified database and then install the records that the API needs to access the already sharded data. You're basically just throwing things away from your cells V1 deployment and teaching the API nodes where to find the things that were already separated. And then the final note that yes, there's still an API database and a small message queue in this obviously, but it's not the super converged, big merged thing. So that's the plan that will not be baked in pike but we're really hoping that people that are already running cells V1 right now will help test, prove out cells V2 in pre-prod that kind of a thing to help us make sure that this process is bulletproof and doesn't take down GoDaddy because all our websites run there. So that is all I had, 40 minutes and seven seconds. So I think we're the last session maybe so we can question still if anyone has them. I have a quick question. So this is not gonna be visible to the users, right? Is like part of it. Correct. So it's not gonna show up in like the host aggregate hierarchy. They will never know about it. It's only known to the scheduler. Yes, and we have a definitive mandate to ourselves that we will not leak this out of the API. The goal in pike is to have the ability to split out another cell and have things work. And we've got, did I not have a goals thing somewhere? But yes, so we've got almost ready to merge multi-cell testing in the gate and that's when we will consider it reasonable for people to start proofing it out and we expect that to land in pike. You can actually do it now with what's in master. The conductor in the cells handles things like coordinating migrations between compute nodes like it does today. It handles isolating the compute nodes from the database schema like it does today. The boot process is handled by the superconductor now instead. Yeah, it's the same code and it's the same process. It's just running out a layer that it has visibility to the full set. Yeah, so rebuild and boot and rescue, not in rescue. There's a few operations that get handled by the superconductor instead of the other conductor if they're split, but no duplication and no calling from conductor to conductor. First, I definitely has the impression of the cells version two as a design is cleaner and simpler than the version one. So my question is, do you have any metrics numbers compared to some kind of scalability like the maximum on the upper? Name it as a virtual machine or the host sales version two can manage? I think that that is probably gonna be very similar to the magic number for cells V1 because you still got kind of the same level of traffic in the sharded databases. So I just watched a presentation from someone from Mirantis that was doing a single unified kind of theoretical thing and was, I mean, there are flat deployments today that do 1,000 compute nodes, right? And so kind of maybe theoretically you could hit that, but I think that most cells V1 people do like 300 or less in a cell today, is that right? Yeah. So it's about 300 nodes? Is that one cell? I'm sorry. I think it will be similar to what it is today for cells V1, which in practice is usually about 300 per cell. Okay, thank you. Yeah. Let me get you a neutron person. Hang on. So when you schedule, when you boot, you have to select the network. Correct. Is the assumption that that network lives in all the cells or can use like, how is this gonna work? Yeah. So once again, placement fixes everything. So eventually neutron like Cinder will be reporting groups of contiguous network or groups of contiguous shared storage. What? Yeah, they call it network segments, right? And Nova in placement will be able to attach compute nodes and an aggregate of compute nodes to a particular network segment and to a particular piece of Cinder's shared storage so that if you boot and require a particular network, that will naturally limit all of the compute nodes that aren't attached to that and or make sure that when you do, if you do pick up compute node first, that you then pick a network that goes with it, right? So how you separate your networks, whether it's per cell like a lot of people do or more than one per cell or global for everything is I think completely up to you. And is the goal that all of that's gonna land in pike or is it already there in master all of those? Like all the stuff you just mentioned with placement and neutron. I think, yeah, I think we're maybe hoping to have the shared storage bit in pike and the neutron bit is very similar but I just think it's, yeah, I think it's Queens for neutron, but yeah, yeah. I'm assuming for services you didn't mention like clients you're still gonna have to handle replicating your images manually across the cells? Well, these are not regions so you still would have glance however you have it today. Yeah, you could potentially do your own replication if you need to such that you have multiple glances but they're all the same or something, I guess, maybe. But yeah, the intent for Cinder and for neutron and for glances that they might scale differently and you just have one, right, yeah. So you have a cell or multi-cell arrangement within a region, right? And so if you had multiple regions you would have multiple service catalogs and therefore maybe multiple glances but this is all within a region, yep. Cool, all right, sweet.