 OK, is this working? Yeah, seems so. Audio's OK. OK, let's talk about upgrades. Welcome to our talk, Rule Breaker, Upgrading an OpenStack Cloud while Skipping a Release. We, that is my colleague Rick Zalewski and me, Nano Kriner. And we are both cloud engineers working on the Soothe OpenStack Cloud product. OK, we will start with some general remarks about upgrades and shortly discuss some common upgrade strategies. And then we will come to the more interesting part, where we describe how we did an upgrade while skipping an OpenStack release. And in the end, we will finish with a short outlook on where we want to go next. So who in this room has actually already done an OpenStack upgrade? Yeah, quite a few. Yeah, it's actually an important topic. So yeah, there is a reason why to upgrade. For example, if you want to have access to security fixes or to stability or performance improvements that got added to the project during the development cycle, some people just want to closely follow the upstream development and have access to new features. And of course, it's always good to stay on a supported release. But probably not so many people have enjoyed the upgrade experience so far. Yeah, with an upgrade, there are always some problems coming. Like downtime, that's hard to avoid and always an issue. An upgrade needs a lot of preparation that starts with writing announcements or coordinating all interested parties. A lot of testing is required. After the upgrade, existing workflows probably need to get adjusted. And with the new code, there's always the chance to have a new box in its deployment. And in the worst case, we can even have data loss or corrupted databases, for example. Yeah, that's why upgrades are not always fun. From the product side, we also see, from our customers, topics coming up again and again when we are talking about upgrades, of course, they have even more phases on reducing the downtime. And in the best case, they would like to have a live upgrade. And the possibility to roll back after or in an upgrade especially if it failed. And that's possible already now, but it's not very easy to do. Customers also, they want to have a clear documentation on what is happening during an upgrade, and that needs to be created. And yeah, that's why quite some people want to not upgrade to every release, but skip one or more if it's possible. So as outlined in the previous slides, yeah, what we see out there, there is something we call the upgrade marathon. That is, when a new release comes out, it first is being evaluated by all the interested parties, then starts a phase where the upgrade is planned and tested and that can go through many iterations. And after some point, finally, the upgrade is executed. And after that, there's still usually will need to be done some adjustments for things that got missed during the testing and the new features. If they are there, if they want to get used, they need to integrate it. And with the six month release cycle, often the next release is already there and you could start over again with everything. So people are tempted to just maybe not this time, just not do the upgrade at some point. And as we can see on that graph that we took from the recent OpenStack user survey, at any given time, there is lots of old OpenStack releases still out there in production deployments. So obviously, many people don't do the upgrades, they avoid it. And as a consequence, users have to live with unsupported deployments. They have to manually backport patches. They have to delay services that are depending on the Cloud, on the OpenStack Cloud. And if they finally decide to do the upgrade, then they find themselves on a dead end because the upgrade is only supported to the next OpenStack release. And there's no clear upgrade path available. So let's look at some common upgrade strategies. There is a way that we call the official upgrade process. That is how it's described in the operators manual, where you always upgrade to the next release, but with a release cycle of six months. So upgrades are required regularly and that comes with a high maintenance cost. Unexpected changes can break the upgrade, of course. Everybody has seen that. It requires lots of manual effort because there is little automation. And users and operators have to suffer the upgrade pain regularly. And that's also a drain on the staffing. Another approach that to minimize the upgrade problems is continuous deployment. That's quite risky. And it needs a lot of development manpower and extensive testing and, of course, debugging because often things won't go like expected. But on the upside, you are running always the latest and greatest code. And you're not doing big upgrades, but smaller incremental changes. And this might work for development deployments probably, but in productive environments, I don't want to do that probably. A cleaner approach is to just start from scratch again with a new release and roll out a fresh deployment. Yeah, but that requires lots of duplicated work. Users have to be set up again, for example, depending on the way you're doing it or images. On the plus side, you'll get rid of outdated artifacts like abandoned instances. And a variant of that is to run a parallel installation where you just set up new controllers and then move the compute nodes, change the API endpoints, and migrate the compute nodes after that. But there you need redundant hardware. And yeah, depending on the many deployment frameworks and strategies, there is lots of very scenario-specific and tool-specific and handmade, hand-tailored solutions also for different upgrades. OK, so as we have seen, upgrades, they are not that easy to do. And customers were asking for ways to avoid it. So we started developing a way to upgrade less frequently with skipping a release. And how we did that, I will hand over to Rick for him to explain that. Yes, so let's start with a high-level overview. So what we did, we upgrade our product from Directly Functional to Liberty. So this is a multi-step process with some steps on the old cloud and some steps on the new installed cloud. And we didn't want to achieve now that the cloud is still functional between this. We just wanted to upgrade. So the cloud is between not fully functional. And what we also wanted to do is we wanted to upgrade the operating system along with the OpenStack release, that the customer can also use the newest operating system. So our idea behind this complete process for us, we wanted more like an orchestrated re-installation, an orchestrated re-installation just to avoid as much downtime as possible. As I said already, we wanted to ignore, just ignore the OpenStack killer release and divide it in a way how we can handle all the missed migrations. And also to avoid much downtime, we wanted a configuration management system where we can prepare already a lot of steps before. And as I said already, our upgrade has still downtime and is still destructive. And another goal was we don't want to use extra hardware. We wanted a hardware which the customer already has. For this, we have some requirements. So there needs to be an orchestration mechanism. This can be human, a script, or an configuration management system like Chef, Puppet, Ansible, or all the others. All this, we choose downtime. Another requirement is the new OpenStack packages needs to be already available and tested. For the upgrade itself, there is more space on the controller node needed. This is mostly because we have in between like a duplicate database. So we dump the database. And it depends how big the database is. This can be a lot of gigabyte terabit. We highly recommend that the Nova Compute Data is on a shared disk, otherwise it can take hours, days, or not weeks to copy all the Nova Compute Data around. So let's start with the preparation. Firstly, stop your configuration management system. There should be no changes anymore on the running cloud. Then you can start with your preparations. What you do first, update your OpenStack configurations to the new release. So take your old configurations, migrate it to the new ones. This can take a lot of time. Between two releases, there's a lot of changes in the configurations. So take your time, test all the stuff. And then you should check which special migrations was just needed for one release. This is also important. While all this, your cloud is fully functional, you have no downtime. The next step is we need to back up all our old data. So first, disable all your OpenStack services, but not stop them. This is needed to avoid when a restart, when a node reboots, or what else happens. Then we should shut down all the OpenStack services for non-database nodes. When this is done, we do dump the OpenStack database and save it external. Another point is take backups from all important data. At any point, something can go wrong. So back up everything. When you have all your backups, then you can shut down your complete OpenStack environment and proceed to the next step. Next step would be set up the new OpenStack cloud. So if you want to upgrade your operating system, we install your nodes with a new operating system and install the new OpenStack packages on the nodes. When this process is done, you can start your configuration system and synchronize all your configurations to the nodes. When this process is also done, you can start your database service and restore all your backup data. So restore your database dump into the database and mount the shared Nova compute data into Nova compute nodes. Then the most difficult part is migrating the OpenStack services. You need to run all migrations. You can do it as documented for one release. But when you did some research, you will see there is some special migrations between the releases. So for example, from Tuner to Liberty, we hit just one issue, which was Nova. So we needed to make a special migration there. So first what we did there, we migrated the kilo to the migration level of kilo. The last kilo migration was 290. So we just executed Nova main HTTP sync hyphen hyphen version 290. Then the data, the kilo data, the Nova data was on the kilo release. And after this, there was a special command needed, which migrates the flavor data. This was required to port this from kilo to Liberty. Yeah, this took some time. And then when this was done, we just executed Nova managed to be market flavor data. And the last step for this is we just need to migrate Nova to the Liberty release. It's just executed in Nova managed DB sync. And then Nova is also on the last migration level. When this is done, we finalize our upgrade. So this means start all OpenStack services and check is everything running. So this should mean your cloud is up with the new release. And it should be fully functional again. For sure, we had some issues. First, the configuration file migration took a lot of time. There was so many differences between the releases then to figure out which migrations are missed in between. And if it's really everything properly migrated, it took also some time. Another big issue was, what we do is all or nothing. You cannot just upgrade one service or one node. It's just impossible. When you start, you need to finish it. There is no way back. And this process is more like a predefined way. This is just, in this case, applicable from to Liberty. It does not automatically work for every other release. So you should definitely do backups as often as possible and from everything. Then, test your new configurations to avoid that you have typos in it or misconfigured something. So far, we hit no issues with this way. But every setup is different. And every environment is different. So everything is 100% sure that it works. Yeah, these are a lot of steps, but you can automate it. For example, in our product, we have 10 steps with three steps on the old cloud and seven on the new cloud. But only five of them are really open stack related to where the customer needs to interact. Of course, there is much in the background. But we can really automate most of this. Yeah, what we want in the future, what we want to achieve in the future. For our next release, we plan a seamless upgrade, also with skipping a release. And we want to have no downtime of important services. Not for the next release, but for the next releases, we plan to revert upgrades. And we also plan a better orchestration. Better orchestration is required to have a seamless upgrade. To make this process a lot of easier, there can be stuff done on the open stack side. First, configuration files should be upgradeable. Can be automatically. What would be perfect when there's a tool what you just call on the configuration and the output is the new configuration? It can show the obsolete entries. When it's possible, migrate them to the new ones. And check if there can be security improvements. And when there are new entries, can be added automatically. What also would be nice, uniform configuration files. So for example, the keystone part in the configuration should be always the same and not different for every project. Another good point what can be improved is when the migration tools or commands stays in the future releases and not get dropped in the next release. This avoids all the back porting and forward porting. There is a roll back option in the open stack, but this is very difficult. So this can also be improved. What's already worked on is non-destructive upgrades. So we recommend that every open stack project should use also versioned objects. So this makes the data model independent of the API version and the database version. Yes. So time for questions. Can you see the microphone? You talked about orchestration. So have you validated tools like maybe M-collectives for doing this upgrade? We use for our product Chef, which does all the orchestration. OK. I actually had a lot of questions on this presentation. But one thing that comes to mind is you mentioned the NOVA data on share storage. Can you explain what do you mean by NOVA data on share storage? So the NOVA data is stored on a disk, so all the VMs and so on. And when you want to do live migration, you also need a kind of shared disk between all the NOVA compute nodes. You mean the NOVA, the VMs run on shared storage? Is that what you do? Yeah. OK. So I'm just curious, what was the size of your team and what kind of timeline did you plan and did it go longer than you expected or shorter, like one month, three months? Our team is currently between, I think, 20 or 30 people. And sure, we hit some issues. But we finished in the time we had. So we are developing the solution. We are not running a production open stack cloud. During your kind of testing framework, what was the most difficult thing that prevents you from nicely, seamlessly upgrading, like a TV schema change or something, some configuration formatting change? What is the kind of most difficult thing, typically? The most difficult part at the moment is having the old services running and the new services running at the same time. So we cannot just run the old cloud and the new cloud at the same API endpoints. So we need to have a seamless switch, which is not possible yet. So then for the computer side, you had a separate pool that completely brand new, and then over there you're migrating or just using existing computing pool when you're migrating? In between, our cloud is completely offline. So we have the old cloud, have back up the data, and then we restore the data in the new cloud. So it's not about upgrading against the existing computing pool. It's not. I think it's more interesting to do that, because some cases we don't have such a facility or luxury that we build out the entire new data center. Again, it's like we already have, let's say, 10,000 computer nodes, and we try to upgrade the open stack that we have to use the existing one, right? Yeah, but upgrading, this may work when you upgrade from one release to the next release without problems. But when you skip releases, then you have a lot of more problems when you just upgrade to the next release. OK, got it. Thank you. What was the network topology which you tried out? You have what kind of mechanism drivers you used on this particular system? I think we tested it with all that are common. Actually, I don't know. OK. Did you face any DB schema inconsistencies, like, for example, when you upgrade at Neutron? I think from Havana to Kilo, you have all the alembic scripts that exist for upgrading. But from liberty on, the base level is Kilo. So when you upgraded from Juno to Liberty Strait, did you see any database inconsistency, like missing columns of those kind of issues? So between when you skip the Kilo release, so when you upgrade directly from Juno to Liberty, we didn't hit this issue for Neutron. You didn't hit? Or you didn't? We didn't. OK. So just to answer the question about the network topology, SwissOpenSecCloud is actually pretty flexible about that. And so I don't know exactly what kind of network topology you had in mind. But everything that is supported by SwissOpenSecCloud can be upgraded with what they described. So as long as it's supported, so not sure what you meant with network topology, to be honest. So in terms of backends, OK. So we support OVS, Linux Pre-H, VMware, and Cisco, and so on. And yeah, we can upgrade all of that. OK. So I think that's it. So thank you for your attention. And yeah, have a nice day.