 All right, thank you guys for joining us and welcome. I'm Preeti Desai. I'm an OpenStack evangelist for various OpenStack projects. I work on Keystone, Keystone Developer, and also work on OpenStack security projects, which was formerly known as OpenStack Security Group. And here I have my colleague Gabe. Hi, I'm Gabriel Capisizou. I'm a Cloud Infrastructure Engineer, part of a semantics cloud platform engineering. And I do mostly operations and taking care of our cloud setups. Thank you, Gabe. So in next 30 to 40 minutes, we are going to find out how did we upgrade the entire OpenStack, all of most of the OpenStack core services. And what is this magic behind 10 minutes? Why not five minutes or why not 15 minutes? So in this presentation, we'll go over the upgrade plan and the step-by-step process of what we did and the lessons learned. So after the last summit in Paris, we went back and realized that it's time to upgrade now. And we wanted to upgrade. We were running Havana and wanted to upgrade most of our OpenStack services to Icehouse. So there were a few key drivers for our upgrade plan. Exactly, what are the new features? What are the benefits that we are getting out of this new release? We got the list of these features so that we don't enable features which are not needed or which basically comes by default with the services. And then once we upgrade, after we upgrade our OpenStack cloud to Icehouse, let's say if things go wrong, is there an easy rollback? And can we roll it back to Havana and things move smooth? Next, how can we upgrade with minimal downtime and very minimal impact on our customers? And we have multiple OpenStack instances running, multiple OpenStack clouds. So how do we apply this upgrade to all of our environments with automation and everything will be covered later on? So in this upgrade plan, we have most of the core OpenStack services we upgraded. Keystone, Nova, Glance, Horizon is not listed here, but it was upgraded as well as part of this plan. So for each of the OpenStack services, we actually went through and looked at what is the configuration changes. And what are the deprecated options? And we want to make sure that we don't use those options anymore. And what are the new options in the configuration file? And make sure that we don't enable them by default. We can enable if we need a feature, we can enable them later, but not as the upgrading the whole entire OpenStack cloud. What are the schema changes? How would my database look like after the upgrade is done? Are there any unknowns? We better make sure of knowing how our schema would look like after the upgrade is done. And also thoroughly test, test, and test. Functional test of the entire each of the services, and then integration test with the entire, integrating all of the services together. So now, so these are the key drivers, but how did we actually did the deployment and how did we actually did the upgrade? Gabe is going to talk about it. Yeah, so starting with this next slide, we're going to get into more details on how we actually perform the upgrade. So as we said before, we use Puppet to deploy our OpenStack infrastructure. And we have everything in Git. And whenever we start working on a new plan upgrade plan for the new release, we start testing that Puppet code. And we use Hira for everything. So our code does not have any hard coding because everything is described in Hira. And also, one thing that we're doing different is we're not using Puppet infrastructure with servers. We started and we're trying this masterless Puppet setup. And we use orchestration with that. So one of the main advantages of doing this was that it allowed us to run, deploy, for example, compute nodes, run Puppet code on hundreds and hundreds of boxes at the same time. For instance, part of this upgrade was, at some point, we had to run Puppet on 600 boxes at the same time. And that happened very, very quickly. There was no problem with load on Puppet infrastructure because there's no Puppet infrastructure. We also had an upgrade matrix. And this is something you might want to do. We had put down which versions were the components before and after the upgrade. Ever since we've done this upgrade, and we'll go over that later, we have upgraded other components again. And this migration matrix is going to help us moving forward as well. So let's get more details on the methodology of the upgrade. As my colleague said, the main idea is to minimize the control plane downtime. And it's very important to know that there's no data plane downtime. We didn't have any downtime for our customers. But for the control plane upgrade downtime, we minimized it by setting up a separate control plane. And then we used this control plane to test everything. And then when it was good to go, we just flipped everything to this new control plane. Next, I was going to talk about the importance of having an upgrade plan. And in our case, we had simplified and detailed upgrade plan. I put this upgrade plan here. We're not going to spend much time on it. But it was actually an actual upgrade plan. And the reason for having two was that sometimes people are very interested to know what's happening with the upgrade. People get scared. And if you give them a very detailed plan, they'll be more scared. So it's easier to tell them what they need to know. And then people that actually want to know more, they can have that information. And obviously, even this upgrade, detailed upgrade plan, is not complete. We would have another column that each and every step has a start time and an end time. And then there's another column that has actual commands that were executing. And then also resources assigned to each step. So you want to make sure everything goes smoothly. So next, I used to have a picture which didn't make quite sense. So I had to redo the whole slide. And this is the first time I'm doing animations in my slides. But this is what we would start with. So we had the compute nodes. And this is a simplified diagram with three cloud controllers, database, load balancer in the middle, and then our compute nodes. So the first step of the upgrade was to build new cloud controllers. And we're using all a virtualized control plane. So this step was actually a very quick step. Then the second step was to find the test compute node. And there's two ways of doing this. One way is to build one from scratch. In our case, it was easier to just pick one of the existing compute nodes, disable it in Nova Manage, and then move it over on this side for testing. And that compute node was a compute node with some VMs that we can afford to play with them. The next step was to build a new database. In our case, we actually wanted to do some upgrades. And we built a new database cluster for the new control plane. And then the next step is to sync the databases from the old control plane. And the next step, which we'll go over in details later, is to we created fake Vips on the load balancer. So existing Vips were copied over. And then we replaced the nodes behind these Vips with the new nodes from the new control plane. So we connected everything. And part of this is also we'll go over that faking these new Vips and adding host entries to these boxes to point to the new Vips. The next step was to actually go and install OpenStack, deploy OpenStack on the Cloud Controllers, and also upgrade Nova on the test compute node. And that was actually a fairly easy task. And once everything was running, we started testing. So this is the part where we ran tests and tests and made sure everything worked OK. So everything up until this point can be done in how many times you want. You can work for this, doing this for one hour or for one week, however long you need. There's not going to be any impact on your running and production environment. And now if we move to actual cut over when everything, you decide everything looks good, then when the downtime of the control plane starts is when you flip the Vips. So everything goes to the existing Vips. You disconnect the old Cloud Controllers. And one other thing you need to do is you need to re-sync the database because if a lot of time had passed, obviously stuff changes all the time in the database. And the next and the last step is to upgrade the compute nodes. And the last action was to do some cleanup. In your case, you might want to preserve the old controllers for a while because sometimes things don't go, I mean, maybe you forgot to save something or not everything was in control and config management and you needed to look at some other file you have there. And the same thing with the database. You might want to preserve that as well. But last and not least, don't forget to clean the fake Vips you had created on the load balancer. Otherwise, people will find them after a while and nobody knows what they are. And they're not going to touch them and it adds up to the config. So we'll go over some more details on each component of the subgrade with creating a new control plane and new cloud controllers. It allows us to do OS upgrades. For example, if you were writing a very old OS, this is the time to do everything from scratch and install a brand new OS. Also, I should mention this plan works. In our case, we're running everything virtualized, but some people run everything control on bare metal. And this plan works fine without having everything bare metal. So everything applies. When we mirror the control plane, and when I mentioned the load balancer, same thing, everything works with either hardware load balancers or AJ proxy. It's the exact same thing. We have one of our things is to terminate all endpoints with SSL and TLS. So everything is running with HTTPS. And we have a guide rid of these ports for different components. If you can have different IP addresses for your endpoints, it's probably easier. And you want to use TLS. It's probably easier to just get rid of the ports and use names and DNS for these components. Yeah, the other thing was when you change host entries, make sure, first of all, it's a funny story that when we were testing this for the first time, at some point we had a problem where our fake control plane wouldn't connect to the fake VIPs. Nobody knew why. And eventually, we got to this box and we looked at the host file and we put the host file entries as name and IP. So make sure you put the right format, its IP and name. We learned this. Nobody would think that was a problem. But also, you need to remember, if you use NSTD, that needs to be flushed or restarted to get rid of the cached host entries. The database, obviously, you should back up your database and back it up elsewhere, not on the same box. And the same thing, you can build a new cluster or now that we had built a new cluster, we'll just reuse it. And we're gonna do that by having different, we have used different database names from the default. For example, instead of using Nova or Keystone or Glance, we have used Nova underscore distribution name. And that way, it will allow us to apply this plan at the next upgrade by just creating a new database. We're gonna call it Nova Kilo and we're gonna reuse everything else. Schema, for the schema upgrades, make sure that each component has slightly different commands for upgrading the schema. And you need to make sure that you either script it or you know exactly what's gonna happen. Otherwise, you're gonna end up in the middle of the upgrade looking at man pages and you don't wanna do that. Also, the UTF-8, previously, they would use Latin for encoding and now we're using UTF-8. So that's something you wanna make sure you take care of before running a DB sync. And the importance of timing, we went over that already. It's really important to know when to do database operations in the bigger picture. I'm just gonna talk a little bit about the validation. We had set up a validation, like a test environment prior starting doing anything. And we ran some tests and got a baseline and we used this test environment before, during and after the upgrade, during the upgrade by repointing the test environment at the fake control plane and that allowed us to test everything without having any impact on production and everything was up and running still. And obviously after the upgrade was done, we went and removed the host entries from the test environment and then we re-ran the tests. So next we're talking about the cut over time. It's really important to set expectations which are customers and communicate properly what's gonna happen. And one of the things we had to do was to make sure that our customers understand that there's no, everybody got scared when they heard we're doing OpenStack upgrade and when we told them that your VMs are gonna be running, there's nothing is gonna go down for you. They were confused, what does that mean for me? And we told them it's just gonna be that you can't manage your cloud during the subgrade and then they were all good. We had good feedback about this plan. And during the cut over, it's very important to have everything detailed for people that are executing the plan. And if that's working, you wanna make changes for the next time you're doing it and the more detailed it is, the better it's gonna be for the next time you're doing it because you can have anyone execute the plan. You don't have to have your best engineers execute the plan, you can have them doing something else and anybody can execute it. So again, people assign to different tasks of the cutover. So the next slide we were, you might ask the 10 minutes downtime, where that came from. And again, this is our experience and you might have a different experience. This is something we have done and it worked for us and it might take longer or shorter for you to do this. But as you see the first two minutes where of the cutover and the downtime of the control plane where shutting down Nova compute on all our compute nodes. At the same time, we had to stop components and read only the database. Then when everything was shut down, we ran a DB sync at the same time as doing the LB work and flipping the Vips with the new ones. And in the middle of it, this is kind of the last checkpoint before upgrading the compute nodes. You might wanna retest everything. And the probably the longest step was gonna be upgrading your compute nodes, which is gonna take some time. And like I said previously, when we had the puppet infrastructure with puppet servers, we sometimes ran into issues when running puppet on at this exact same time on hundreds or maybe thousands of nodes that was not easy to take for the puppet servers. And we don't have that problem anymore. So it allows us to do it very quickly. And last step is to start the components of the new control plane. Again, after the upgrade, we wanna do validation and testing. So obviously we talked about having the validation, the test environment previously set up. And you should have everybody, I mean, your team use that to test the upgrade, but also it's important that you work with your customers because sometimes there's changes, there might be API changes or people have used a specific feature that might be a little bit different with the new release. And if there's issues, you might wanna have resources and work closely with your customers to make sure that they're gonna be okay for the after the upgrade. And a few things we have learned after this upgrade. And ever since we've done this upgrade, like my colleague said and what we said before, we're trying to be six months behind the latest code. And one of the things we have started doing is we have started upgrading individual components and related presentation, our colleagues have presented yesterday about how to do that and what was the driver for doing that. But to give an example, we have already upgraded our Keystone component to Juno and we're getting ready to upgrade to Kilo. And a lot of the stuff that we have presented here applied for the subgrade. And the way we did it was just to pull or shut down the existing Keystone component on our cloud controllers and then build new Keystone servers with the newest release or the newer release and just do everything, swap the things in the load balancer. We use the same methodology for using a different database for this new Keystone install. And we pretty much applied the same pattern of not having anything down until everything is ready to move over and then when everything is tested and ready to go, we're just flipping it over. And about the Keystone and the database, one of the things I forgot to mention was, and I was actually curious how many of you guys are using, I was wondering if you can raise your hand, PKI tokens for your Keystone deployments. Anybody using PKI tokens? Nobody? Yeah, so that was one of the advantages we had was when we switched to using PKI tokens, our Keystone database performance got improved a lot and it made this upgrade easier. And also we're looking into using containers for deploying the control plane. And we're also looking into other upgrade patterns. So I'm a bit early done with this presentation because I wanted to leave some time for questions. So if you guys have any questions, if you wanna speak to the microphone, we'll try our best to answer. Did you have any like underlying library stuff like QEMU or stuff like that to do as part of the upgrade process or no? Do we have what? Any underlying library upgrades as well like QEMU or? No, we didn't do that, but that's something we're gonna do for the next upgrade actually. Hi, I have a question about rolling upgrades and the possibility of deriding up the control plane upgrade for Nova and Neutron from the compute node and the agents running there. And if that's a possibility, did you consider it? I can think of multiple reasons why that would be interesting to me. Upgrade the control plane, but not upgrade the computes and just gradually upgrade them as you want. Right, right. That's exactly what we started doing now. Like I said, we've already upgraded the keystone to Juno, so we're running a Juno keystone and we're almost ready to upgrade the glance to running a newer glance as well. It is possible to do like Nova API, Nova Schedule, Nova Conductor, but leave the Nova Compute. So that is something that I understand is possible. We haven't had the time to run it and test it, but my understanding that is possible to upgrade the Nova Schedule API and having an older version of the Nova Compute. It's something we wanna test as well. It would be interesting to. In your presentation you mentioned you are using a keystone Juno already. So as I know, the DB migration from previous version to Juno, they're combined three metadata tables into one, so that would be huge. So the data migration takes longer. In your way, you set up new tables and do the migration in prior and in the short term, you re-sync the old database into the new one. So how do you make sure that process takes shorter? So one of the things that made the database migration shorter for keystone was the lack of tokens in the database, which used to take a really big space. And we actually didn't have this, I mean, it was really quick when we migrated the schema from the ISOs to Juno. It didn't take a long time. And I think it was a minute, if I'm not mistaken. Sank took less than a minute from ISOs to Juno on our keystone node. Okay, so you didn't call me the token table? There's no tokens in the keystone database because we're using PKI and so PKI tokens are signed by keystone and they are validated by individual components and there's no tokens in the database. Okay, I see. So if they're using PKI to do validation from server side, there will be an end. Right, right. So there is a keystone middleware which validates the, let's say if the token is revoked or a user is disabled, but the token is still valid. So the revocation happens, it checks back to the revocation events which we have enabled. So in Juno, there's the auth middleware and the revocation events. So it doesn't have to go to keystone database. I see, okay, cool, thanks. So if there's no other questions, we thank you for coming to our presentation. I hope it's not scarier than it used to be for you today.