 All right. My name is Dan Smith. I'm an engineer at Red Hat. Since I started with OpenStack about two years ago, I've been tasked with making the live upgrade goals of NOVA come to fruition. Focusing mostly on NOVA, eventually someone's going to make me have to look at the upgraded issues of other projects. But NOVA has plenty of issues to keep me busy for a little while. So that's what I've been working on for a while. And today I'm going to talk about the progress that we've made, where we've come, and hopefully where we're going to end up before I go crazy. So I think we have people in the room that are like operators that run clouds. You all have a switch outside your data center. It looks like this, right? Yes. Yep. Sometimes it's a big red button or it's got like a plastic cover on it to make you not hit it accidentally. If you're not an operator or you're not sure what switch I'm talking about, that label helps. It's the run or upgrade scenario. So right now in NOVA we, well, not right now, but historically upgrades have been very simple in that you turn NOVA off for the most part and do your upgrade and turn it back on. We call that simple upgrades. That's the feature. Obviously that's not really the ideal way to do upgrades of your code because the cloud should never be turned off. But until now, all of the components of NOVA had to be on the same code level all the time. So this was pretty much the upgrade switch. In general, across OpenStack you at least can break the problem up into services. So because we have strong API guarantees on the external APIs of the various services, you can generally move forward one service to the latest release, make sure everything's cool, then roll forward the next one. So generally we try to recommend that people do Keystone, make sure that's upgraded, and then do something like Cinder and then Glance. And then when you're totally out of other things to do, then you do NOVA and roll that forward. So the APIs help us because NOVA consumes Glance's stable external API. And so you can move Glance from one version to the next without any sort of issues. And then they're only tied together on the external API there. But when you get to a particular service, you generally do the whole service all at once. And for a smaller service, that's not quite as big of a deal. For NOVA it's a huge deal because you've got NOVA everywhere. You've got just piles of it in every corner. And so turning everything off is just a real horrible thing to have to do. It prevents anyone from doing much of anything. So that's what we want to try to improve, at least from a NOVA perspective. The point I will make here, though, is I say turn the cloud off. And if you've not done it or you didn't realize, generally when you upgrade NOVA, you can avoid turning the instances off. So you may turn NOVA off from the perspective of the user can't actually create new instances and list their instances and do all that kind of stuff. But the VMs are actually up. And if you don't break your networking while you're doing your upgrade, then they mostly don't notice, except for the fact that the users can't do anything. So just to make it clear, we at least are able to do that much live upgrade, where NOVA can be switched out underneath and the instances can stay running. So since this talk is the march towards live upgrade, I have to give a little bit of history. Full-sum timeframe and before your deployment looked something like this. You had a bunch of API nodes. You had, hopefully, a ton of compute nodes. You had a queue and the database. And for the most part, everything treated the database as the kind of central knowledge, central source of truth, and kind of everything went into the database for almost every operation. So you would have a request come to the API from a user, and the API would put something in the database, manipulate some things, pull some stuff back out, and then it would construct a message to send to some compute node that was actually going to do some work. So if you were booting an instance, the API node would put some stuff into the database, pull out a record ID, send it into the queue, and bounce around. What happens is when it comes out of the queue and goes to the actual compute node that's going to handle it, the compute node would generally have just like an identifier of some resource. And it would have to go back to the database, pull what the API node just put in there out, and act upon it, probably write some status updates back into the database and everything. It was all very, very database centric. So this is not necessarily a problem, but it does mean that all of the nodes that talk directly to the database have to be intimately familiar with the schema of the database. So anything that we were going to do to the code to make it tolerant of moving the database forward at runtime would have to be code that is running everywhere. And so we wanted to try to get away from this. We wanted to get away from this technique for other reasons as well, so it fit pretty well. So grizzly time frame, we made a little bit of progress. We had an effort that we called no database messaging and then no database compute, which instead of putting stuff into the database and just sending an identifier over the queue and then having the other node on the other end pull it back out, we ended up sending the whole thing that we were, that was the subject of the action over the queue to the resulting machine so that it didn't have to go back to the database just to pull out the thing that we were talking about. So if you're booting an instance or taking action on the instance, the API node would pull that out of the database, write some task state thing to the database and then send the whole instance over the queue so that ideally when it got to the compute node, it had the whole object. It didn't have to go fetch it again and that meant that it, we massively just decoupled the compute nodes from the database and towards the end of this whole process we actually got to the point where the compute nodes were not allowed to talk to the database. It was kind of our enforcement mechanism so that if anybody added any code that would cause the compute node to be, to directly access the database and therefore be dependent on the schema, it would blow up because we had kind of broken the broken that path so that it couldn't actually work. Problem with this is of course that when we pass the instance to the compute node, the compute nodes got to eventually write some state back into the database about the fact that it finished it or whatever you asked it to do, it's got to make those changes and then of course every 10 minutes or every 30 minutes you've got periodic tasks that run that have to take action in the database. So we introduced a new service into Nova that was very simple and scaled horizontally very easily and was relatively easy to predict. You just, if you're running out of processing capacity, you just start more of them and this service served to give those compute nodes a path back to the database when they need to actually make a change. So we call that the conductor node and it's in a little flashy bubble because it was a highly contentious name and it doesn't really conduct many things but anyway, that's what it's called. So if you've got a deployment and it runs Nova conductor and you're wondering what it's conducting, that's the, it's primary task there. So that provides us the ability to mostly pass the big subject of an operation ahead of time to the compute node and then let the compute node use another service to write changes back to it after it's done. And so that keeps all of our schema sensitive code all away from the compute nodes of which you have the most of, right? So the goal here is to try to compartmentalize the bits that have to actually be really, really concerned about the schema. So we made that progress in Grizzly and we were hoping to, at the Havana Summit, we were hoping to make Havana our first upgradeable release. So and when I say upgradeable, I mean live upgradeable which means we were going to hopefully land a bunch of things in Havana that would allow us then to live upgrade from Havana to whatever I would be at the time. So we'd made this progress in Grizzly. We had decoupled all the compute nodes from the database. The problem is we were basically just remoting the same DB API calls that we had before through this RPC mechanism. So while it wasn't using an actual MySQL connection to talk to the database, it was still doing basically the same queries and the same updates. So it was still fairly tied to the actual state of the database and then of course when the API node would pull something out of the database and send it over the queue to the compute node, it was just whatever the database barfed out. And so it was very schema sensitive to that as well. But we had made this progress where we had kind of drawn a box around everything that a compute node needed to do so we had an idea of what we had to encapsulate. And the other problem with this is that in order to pull something out of the database we would normally get a SQL Alchemy object which despite how hard you push won't go through an RPC channel. So we had to basically serialize everything to JSON and then once you send it over the RPC bus to the compute node, it's never anything more than a dictionary. So we lost a lot of kind of rich object functionality that we had before like type checking and that kind of thing. But we had gained this conductor service which provided us a little bit of insulation which means we could start to adapt our approach and not have the nodes actually running old code directly talking to the database. So when we rolled forward to the next release we could have code in the conductor that still serviced that old API or that old schema but that gave us a chunk of code that was gonna be switched out that was in between the compute node and the database so that we could actually effect this change. The other problem is that what I just said we have now all the compute nodes using this conductor interface which was great for the isolation point of view but all the other nodes that could talk directly to the database were using the database API. So we had like a duplicate of all of those methods in each one of those API definitions which meant that they weren't always exactly the same and they would drift a little bit and we'd add a feature to the database API but we wouldn't need it for a compute node and so we wouldn't add it to this API and it got a little confusing and then of course we do occasionally reuse code and so we would have some code that needed to put something in the database either over RPC or direct to the database and if it was used by two different components that had different policies of whether it could talk we ended up passing API objects around and it got really, really ugly but it was kind of our first step to breaking those things apart. So in Havana we made some more progress towards addressing some of those issues. Specifically we wanted to simplify how all of that worked. We didn't want to have a database API that some code used and a conductor API that other code used and people have to know which API they were allowed to use where their code was going to run and which service, that kind of thing. So we started this effort that we called NOVA object and we usually shorten it to object or objects which is very conveniently confusing because everything has an object and so when we talk about the impact to the object or whatever we're always talking about NOVA object of course but it gets a little confusing. So formally we call this NOVA object and it really helps unify in a kind of an object oriented way how we interact with the database and how we pass those objects from like the API node all the way down to the compute node. The nice thing is that it's used everywhere and it lowers the bar to if you need to add a new method that operates on some data or you need to add a new field to an instance, something like that, the architecture helps make sure that kind of make sure that you get that correct. So instead of just adding a new raw RPC call that you're passing some blob of data to it kind of helps bind all of the upgrade sensitive things together. The RPC call that you're going to make, the data, the new piece of data that you're gonna add, the version tracking of all of those things, it means that we're able to add new things much more easily and get it right the first time instead of having to iterate. We find out that it's broken for one upgrade and release and iterate it there. So it helps us to get it right the first time. It bundles a set of methods and a set of data with a version into a single package so that we have kind of a concept of what the instance object looked like in Havana versus Ice House and what it means to take an instance object that we need to pass to an older node and back port it to that old version. So if you've got an Ice House node that pulls an instance out of the database, we know that the Havana nodes need version 1.2 of that instance and we have a defined routine that can back port it so that we can send it over. Whereas before, we were just taking whatever was barfed out of the database and shooting it across and so if there were two different versions of the code, one of them was expecting an old schema. We didn't really know what they were expecting or how to back port it because we hadn't codified all that. So this really helped us define the format of the things that we were sending over RPC whereas before it was just whatever the database at hand is. And then additionally, kind of inconsequentially but the object itself kind of automatically serializes and deserializes itself so you can pass it over RPC efficiently without this kind of generic turn the whole thing to JSON and it's JSON forever. So when it ends up on the other node that you sent it to across RPC, it rehydrates itself into this rich object and we get better type checking and kind of bounds checking. So we have a whole bunch of tests for example that pass a dictionary to a piece of code that looks kind of like an instance but we've got an ID field which is actually an integer but we've got it as a string because the code doesn't kind of really care and then you run that on Postgres and it totally blows up because Postgres cares. So it really helps us to better define the data that the code is working on. So we did a lot of object stuff in Havana or we're far from done but we put a bunch of that stuff in that was going to provide us the launching point to be able to do our live upgrade to Icehouse so that we had defined what very, very closely what messages were going across the bus and what we needed to do to Backport and that sort of thing. So for our first live upgradable release instead of just going full on you can shoot any node in the head and upgrade it at any time we defined kind of two sets of the problem. We were gonna require you to upgrade the control services atomically so that's your API nodes, your conductor nodes, your scheduler, that kind of thing. Hopefully the things you have fewer of and those have to go with the database schema. So the database schema upgrade we know is painful and long-lived and so that still has to happen. It has to happen in conjunction with these other services. However, you can roll all of that stuff up to Icehouse by itself and leave all of your thousands and thousands of compute nodes alone so they can still be running. Thank you. They can still be running Havana code. Yes, thank you. So I promised I'd pay her, she did that. So this is a major improvement I think because you've got all these compute nodes. They can all mostly stay the same. You can upgrade your database, your control services, get them all up to the level of Icehouse and then you can start picking off your compute nodes as it's convenient and upgrading those to Icehouse as well until everything is done. So this is not the unicorn, but it's getting there. It's a step and I think it massively reduces the size of the problem. In order to make sure that this worked, I mean this sounds great, but we really wanted to have some strong verification that A, this worked and B, it didn't work like Tuesday and then break on Wednesday because it is actually really quite easy to break. So towards the end of the cycle, we had a new job startup in the upstream gate CI system that would start a dev stack on Havana and it would run the tempest smoke tests against it and kind of make sure there's some data in the database and that is a running functional Havana setup. It would leave the compute node running and it would shoot in the head all of the other services, bring them up to Icehouse code, run the database migrations and start all the services back up so you ended up with another dev stack running but that everything was Icehouse except the compute node and then it would run the tempest smoke tests again and make sure that it still worked. So we did that, we found some kind of lingering issues. Once it started working, we promoted it to voting. So now try as you might if you go and introduce some patch to Nova that breaks the upgrade cycle for this well-defined procedure between Havana and Icehouse, the gate system will kick your patch out and say you're not allowed to do that. Obviously there's a lot more we could do for that verification process but at the very least it's a pretty well-defined box of functionality that we test on every single patch to make sure that whatever you're proposing is upgradeable from Havana without having to restart your compute node. So in Icehouse, this is what you've got. You atomically upgrade your database schema and all your compute nodes to the Icehouse code and you leave all of your Havana compute nodes alone. Then when messages come from the newer API nodes through the queue and they hit a Havana compute host, they're gonna have a version number attached to some of these objects that is the version number from Icehouse. The object infrastructure that kind of sits around the compute node now looks at the version and say well, I don't know about version 1.7 of the instance, I only know about version 1.4 and that architecture automatically kicks the object out to your conductor which you've already upgraded and the conductor can backport the instance object to the version that you say you know. So this happens kind of automatically and so as soon as you get your deployment to this system or to this state where all of your compute nodes are old and all of your control services are new, you're taking this kind of indirection penalty for every message that's going across. However, as soon as you upgrade one of your compute hosts to Icehouse, suddenly it's fine with the object version that it's being sent, it stops kicking things out to conductor and you stop taking this indirection penalty. So this happens for each individual one independently and once they're all upgraded then you stop having to do this little bit of indirection. The really nice thing about this is that you may want to take some downtime on your actual compute nodes while you're doing this big upgrade, right? You're gonna be rolling forward from Havana to Icehouse on all your compute nodes but you wanna take one down, you wanna get all of the instances off of it, live migrate them all off to another machine, take that one down, bump the kernel, maybe do hardware maintenance or whatever. So the migration messages that go between the compute nodes to move instances around will honor this as well. So you have a newer Icehouse node and an older Havana node and when you try to migrate an instance from the older to the newer, they have to speak, right? And the conductor will do the same object translation for the node to node communication as it will for anything else. So as part of this Havana to Icehouse migration path that we've got, you can actually cold and live migrate your instances from Havana nodes to Icehouse nodes as part of your upgrade. So you can go through, you can add a couple of clean Icehouse nodes, migrate or evacuate a whole Havana node to an Icehouse node cleanly, then shut the Havana node down, do whatever you need to do to your upgrades, your kernel reboot and everything, boot it back up as an Icehouse node and then you can evacuate another node to it. So this is a pretty, you like that too? Sweet. Yeah. So this is a, thanks. This is a pretty big improvement as well because you really get away from the point of having to bring down instances even to do hardware maintenance and kernel reboots and that kind of stuff. So then when, like I said, when your last compute node goes to Icehouse, it gets its upgrade, it stops taking this indirection penalty and then your overhead goes to zero totally. So you're all the way back to a regular deployment. There's really not much you have to do to get back to this level of functionality just stops happening as soon as the compute nodes are receiving objects that they know about. So that's the Icehouse upgrade path. So obviously we're not done because I put an X through the unicorn and we can't stop until we get there. We've mostly baked these objects into the compute node because that's the big thing that we really care about making sure that we can isolate the compute nodes from the upgrade process. But obviously we wanna break down the group of things that you have to do atomically into as small pieces as possible, right? And right now everything except the compute node has to be done in conjunction with the database schema which is not ideal. So continuing to push the objectification of the other services will help with this. Nova API would be a nice thing to get upgraded fully to objects because then we can decouple the API service from the database schema. If we kick the scheduler out of Nova, which we do a lot, we kick things out of Nova constantly. Really hoping to kick the scheduler out. If we do that we'll get some other sort of kind of stable API between Nova and the scheduler which will help us prevent the scheduler from being too dependent on what our database schema is. So that will reduce the scope of the things that have to go atomically to just the conductor and the database schema. So ideally we would break that apart as well. There are a couple of different ways to do this. I would really like to be able to upgrade the Nova code first and have it be tolerant of the old schema and the one that it expects. I think that our use of SQL Alchemy might make that actually a little bit more complicated. So we may end up having to do the opposite which is change the way we write our database migrations to only be additive for like the whole cycle and then have a migration at the end of the cycle that drops all of the cruft that we're not using anymore. That would allow us to do the upgrades, the schema upgrades in parallel with regular operation. And then when you start to upgrade the conductor node, the object layer will then kind of on demand do the data migrations once the schema has been upgraded to have these new columns that we need and we have that the objects layer provides us the isolation to do that independent of the rest of the code that might need to know about it, the problem is we don't have a really good way to do that with SQL Alchemy migrate right now. So Alembic, moving to Alembic might provide us the ability to have kind of additive and a subtractive migration path. Alembic, it's a schema migration tool. So anyway, that's something that we all are looking forward towards being able to move the database migrations completely out of doing any of the code and have the code be able to be gracefully upgraded in individual silos. And then of course, the more complicated we make this, the more diligent we have to be about testing it to make sure we don't break it because I can break it all day long. So we have this kind of single node setup right now where we have a single compute node, we upgrade the rest of the services, and we make sure that works. We don't actually run a full Tempest run against that, so there probably are complicated functions that wouldn't work in that upgrade scenario that we don't know about, so moving towards running a full Tempest run against that would be good. The problem is that whole process takes a really long time, and it really sucks to be the slowest guy in the gate. So it's kind of a big deal to run a full Tempest run and then do your upgrade, then run another full Tempest run if you've got a two hour long job, it's just not very good. So we gotta figure out how to do that more quickly and expand the test coverage. And then the other thing is it's not super realistic if you don't have multiple nodes, right? If you don't have like a upgraded compute node and a not upgraded compute node all talking to upgraded infrastructure. So we really need to get multi-host testing in the gate done so that we can actually have this kind of very split deployment where we've upgraded our conductors but not our API nodes and two of the three compute nodes or whatever. So the triple O guys I think are probably gonna end up helping us out with this. They're working on some gating that does like kind of real deployments and would do that across multiple hosts and hopefully let us simulate a more realistic environment where you've got more than one compute host because aside from on my laptop, single node clouds are kind of boring. Yeah, complicated, huh? I really haven't looked at doing this for any of the other services. So from a Nova perspective, if you're using Nova network, all of the things that I just described are true. If you're using Neutron, then you've got whatever their upgrade requirement is which I believe is the agents that run on the compute hosts have to go down with the Neutron server and come back up. But ideally, once we get this figured out for Nova, we should, anything else that runs on compute hosts should follow a similar kind of strategy so that we can avoid having to take down all of the Neutron components at a time. This is really specifically just for Nova at the moment. Yep, probably. Sorry, I'm still listening to the question. Go ahead, say it again. It could, the question was can you hot patch the code that's running in the compute host to understand the newer version of the object so that you can stop taking the, stop taking the indirection hit? I mean, at that point you've restarted the, so we could talk about like backporting the object layer to Havana so that you could end up taking a small upgrade to all of your older hosts that run newer object code but no changed Nova code that would be able to tolerate the new version of the objects potentially. But we don't have a versioned interface between the object's layer and the stuff that sits above it, right? So it would have to change that a little bit about how we do it, but we could. So the thing that usually is the biggest performance hit on the conductor nodes is the fact that they are doing a lot of database for you and they're mostly blocking when they talk to MySQL because of the way we do EventLid and use our MySQL driver. So I think that in order to sustain the performance that you need for your cloud, you've already got them pretty well scaled out and the hit that they take when you do this indirection is purely just CPU. So I think that you probably are gonna have conductor hosts over-spect from what you need for this particular activity anyway. I'm not sure if that makes sense, but it's not, it's the least complicated thing that a conductor node has to do, so. What's it, sorry? You certainly do. So yeah, I mean obviously this is the first release where we supported this functionality, right? So please tell me how it goes. But potentially you could generate a lot more load on your conductor. Yeah, you don't get any of this if you use local conductor, right? Yep, obviously because you haven't upgraded your layer, yeah. That was the last slide, so if anybody else, yeah. Okay. Oh, look at her waiting at the microphone. I am, all nice and tidy. No, it's all good. Really around the schema independent with the conductor. So we get to where we're using conductor. We have this great breakout and then Nova's going to be looking at these schema options. Where are you all in that conversation? Are those things that are going to happen this week during the design summit? Yeah, so I've got a compute track session tomorrow or the next day to specifically talk about what we need to do for the next cycle. And I would like to propose that we try to be only additive with our schema migrations for the next cycle so that we can try to have some testing of applying the schema changes underneath currently running code. I haven't actually looked at the migrations that we may already have in the Juno tree, so we might have already broken that, but absolutely that's one of the next things we need to talk about. Okay, so that would just be a good one for me. I just came from one of the operator meetups and making sure there are at least a few of us that are doing more of the deployment and the operations of it to see how that is going to impact us. Yeah, I know it'd be horrible to take away the fun for the operators of applying those migrations, but. I know, it would be worse to have to do them all at the very, very end. Oh, well, no, I'm not saying. Even just the cleanup ones, that's just kind of what I wanna be part of the conference. Well, the cleanup ones, you would be able to run whenever you want, right? So if it's purely additive, it would be the kind of thing where, this is the problem with our current schema thing. We have this linear version number and so we don't have this really nice ability for you to insert a migration into the path at some point that doesn't affect anything, but ideally we would make it so that you can apply pieces or all of the subtractive migration whenever you want, if we get the code to the point where it's doing that migration for you. So yeah. So that makes me feel better, yay. And then has anybody that you know of actually done this yet? Anna, like a big, real cloud? Yeah. Thanks for bringing that up, no. Well, no, and I'm just saying, I'm just saying that I've made this. It's only been like three weeks that it's been out, so. What's that? Oh, there you go. I know what we're gonna be doing the rest of this month now that we're here. Yeah, so we'll let you know. I'll go on vacation or something. Yeah, I'm sure you will. Yeah. Well, now we've got the gate job, so Jenkins will just smack you if you break it, but the one thing that we would really love is for people that are adding other features that aren't necessarily objects or upgrade related to look at it from the upgrade perspective, right? If you are doing anything that talks to the database or uses any data out of the database, but that isn't using an object mechanism to do that, it would be really great if you could prefix your patch with something to convert the old style to the new style and then make your code use objects instead of the raw database API, which just really, really helps make sure that more code ends up with this isolation layer between it. So that's probably the easiest thing you can do is if you're adding code or proposing adding code that talks to the database API at all, don't, and use the object API. That would be perfect. Is it version of the object? Or is it straight forward? Most of the objects are created, so if you are looking at instances or aggregates or migrations, if you go into the objects directory, there is a thing for that, and what's nice is the database API is like 500 raw functions, and the objects are all broken out by thing, so if you go look at the instance object, just the instance related methods that you can do are documented there, so it's actually a little bit easier to find. It's just most people are used to looking at the database API. Anyone, but Josh? No, it's okay. Go for it. Yeah. Yeah, Ironic does. Yeah. I haven't been pushing and some other people have. It would be nice because the Ironic and Nova have diverged a little bit, and that's the point of Oslo, right, but yeah, are you volunteering? It's definitely something that would be good. Yeah. Well, yeah, having a pattern for each of the services that you know to follow, I'm sure the operators would appreciate that instead of something totally different from each one. We've already got a different DB sync command for every project, so that's probably the Ironic, right? So, I think there's like two more minutes if there's other questions. Meaning, predefined sequences? Yeah. Well, that's like what Josh was saying, like having the same process for each of the things to hopefully make it work would be great. Yeah. Really? Cool. I don't know that I would know, but I mostly ignore those guys are kind of weird. All right, I guess that's it.