 Testing, one, two. OK, let's get started. So we're here to talk about automatic backup and recovery solutions for OpenStack, primarily focusing on triple O. But a lot of the same concepts could be used for other OpenStack environments. So my name's Dan. This is Carlos. This is our first time presenting as well. So this is very exciting for the both of us. So let's get started. So this is just a very rough agenda of what we're going to be talking about. And to give a bit of an introduction to how Carlos and I met, it was about a year ago, I think, September last year, that we met during a fast-forward upgrade meetup in Madrid, working on documenting as part of the fast-forward upgrade process, a backup and restore process to be able to take a snapshot of an OSP-10 environment before upgrading to an OSP-13 environment. And documenting it, both the backup and restore, can introduce several points of failure. And not only that, it's quite a time-stacking process as well. It usually sort of takes a lot. So ever since then, Carlos and I have been working out ways to automate the backup and restore process. And I'm going to hand over to Carlos now to talk a bit about our strategy and what we're actually trying to achieve. First of all, thank you all for being here today. Will, yeah, we actually met together in a meetup about how to run a fast-forward upgrade to get our OSP-10 environment to OSP-13. And well, basically, the issue we were trying to solve there, it's like, OK, we are exercising how to run the FFU. And we kind of usually were breaking the environment. And we were saying like, ah, we need to redeploy again. We were investing a lot of time doing something that we can actually, we were able to do a rollback and start over to exercise again the FFU. So basically, the problem we were facing is that we were breaking the environment and we wanted to recover as soon as possible. We wanted to document it and we wanted to provide this knowledge we were creating to our customers. First of all, there are a lot of ways to run the actual backup and restore. We are kind of exercising one of these possibilities. So yeah. When we speak about backups and restore, we can speak about different categories of running this backup and restore. We can backup actually the user workload like we are going to back up all our Nova tenants. But in this case, we are not going to do that. We are going to back up all our let's call them backend services or the control plane services for avoiding the risk of breaking our controllers. For example, we are running a really, really sensitive operation in our cloud, which is running an upgrade. We break the environment, we break our cloud, and we want to roll back. So basically, what we do is to take the things we need to restore our environment, which is basically config files and databases, which is the most important things we actually backup. The goal here is, well, in short, we can actually restore an under cloud and over cloud controllers no matter what. And try to do that automatically. Why we do try to make it automatically? Because the thing is that if we are able to create an under cloud backup, for example, you can run a snapshot or you can make a dump of your databases. But what if you don't have any more than the under cloud? So you won't have an automated way of restoring it because you don't have any more this under cloud node. Even if, for example, you have a virtual machine or a bare metal node. OK, so now Dan is going to speak about the strategies we use to back up all the individual services. So we're looking at a couple of key services here. And we're going to try and be as comprehensive as possible. We can't go too far in detail, mainly because time limits us. But we're looking at the main services. We're talking about mainly database services, object store services, file system and configuration, so mainly for a lot of the core services. So to start off with, looking at a MariaDB database in a non-HA environment. So for example, when you're doing a backup and restore of the under cloud, it's just a standalone database. And backing it up is quite simple. It's just running the MySQL dump command. And then to restore, it's just a matter of creating a new database starting up MariaDB, increase the packet size because that's quite a large dump, and then restoring the data from the SQL file that you dumped. But for the over cloud, it gets a bit more complex because you're dealing with several HA nodes. They're working together. So in this case, we're talking about a Galeric cluster being managed by Pacemaker. So in this case, what you do is you select an idle node. But you identify a node that's not currently being used, or whichever is the main node that's being used at the moment for your open stack environment. And then backing up the database, backing up the grants, and then moving that to a secure location. And then to restore, there's a process where it gets a bit complicated where you have to disable the VIP access to the database completely so that you get no incoming communications while you're trying to do the restore to stop Galera, to temporarily disable the replication of that because what you're going to be doing is creating a new database on each node and synchronizing it together. And also setting database permissions for each node, for each MySQL MariaDB replication on there for both the root and the cluster check user. And then synchronize the nodes, so bring them up and making sure that they're all talking to each other and replicating. Restart Galera, and then import the database and the grants and then restore VIP access there. We've actually got this documented upstream as well on how to do this. For MongoDB, so this is used for telemetry storage in Neutron. Beyond that, it's now moved to Nokia. But just in case, this is something that we sort of had to do for RSP10 because that's based on Neutron. And that's quite simple. It's just a matter of MongoDump and MongoRestore. And there's links to the documentation and how to do that as well. Ferredis, it's used as an object store for services. For triple low over clouds, we use it for some telemetry object storage. And it's quite simple and quite versatile in how you can back it up. It's just a matter of saving the current state. And it stores as a dump IDB file. And it's just a matter of copying that file to a secure location. And to do the restore, it's mainly just stopping the service, copying the dump IDB file back to its original location and then restarting the service. For Pacemaker, now, this is something that we've been testing in the last couple of weeks. Being able as part of the fast forward upgrade process is something that we've discovered. When you go from RSP10 to RSP13, the Pacemaker configuration changes because originally, RSP10 is using system B services. RSP13 is using containerized services. And the configuration in Pacemaker changes to reflect that as well. A lot of the resources change. So to do a dump of the configuration, it's just a matter of using the backup command. So pcsconfig backup and then the name of an archive that the command generates. So in this case, pacemaker underscore backup. So yeah, that generates an archive. And then it's just a matter of stopping the cluster, restoring the config, using pcsconfig restore and then the name of the archive file, and then starting the cluster backup again. For Swift, now, Swift, as you know, it's an object user's object data and stores as files. And usually, you back this up as part of the whole file system backup. So normally, they're stored in SRV node. Probably the key thing, and this is something that we learned early on when backing up Swift is that you've got to include the X attributes of the files. Or otherwise, you lose the metadata on that. And then the Swift store, basically, the Swift data basically becomes useless. And I've asked Swift developers about this. If there's any way to restore that metadata, it doesn't really seem to be. So yeah, it's quite important if you ever say doing an archive of the Swift data to include the X attributes option there, and same with if you're doing an rsync as well. So something that's vitally important. But in terms of restoring, it's basically the same process there. So restoring back to the same location. Also backing up and restoring the ring files and the configuration for Swift as well. For the file system backup, what you want to do is backup any relevant directories to your OpenStack environment. This depends on your OpenStack environment itself and what you've got installed. So everybody's OpenStack environment is different. They've got different services enabled. So normally, you back up Etsy because that has your main configuration there. Valib and whatever services that you are running. So Nova, Cinder, Glance, Heat. If you're using a containerized environment, for example, Collar, it's a good idea to keep the Collar config backed up so that way you can retain it and restore it if you're running a containerized environment. I mentioned before, SIV Node for Swift, so don't forget the X attributes there. Also the log files, the root directory because that usually contains the root access for accessing the database. Not always. I think for the containerized for Collar environments that stores within the actual config data directory there for Collar and also your Cloud Admin user. So for the undercloud home stack or whatever user that you use and for the overcloud, it's usually Heat Admin. So now I'm going to switch back to Carlos and he's going to talk about his work on performing undercloud backup and restore and automating that as well. OK, so what I'm going to speak about now is the status we have currently for running a backup of the undercloud because we have it partially automated from starting from Queens. So there are, as we said before, there is no one single solution to actually do the backup and restore. There are many ways to do it. So for example, for the specific use case of the undercloud, if you have your undercloud in a virtual machine, it will be really, really easy to create a snapshot of the virtual machine and that's basically it. You don't need to do anything else. You just restore the snapshot and you should be OK again. So if you have a bare metal node, yeah, it's different because you might not be able to create a snapshot because there is no virtual machine. Let's assume you don't have a way to back up or create a snapshot of the volume in which it's like install the undercloud. So we are kind of taking the worst case scenario. Imagine that someone got into our data center and they stole the undercloud node. They run away and we don't have it anymore. So what we're going to do is to reinstall the undercloud from scratch using the data we have from the backup. So we have it partially automated. So if we are starting from Queens, we have a new CLI option if you're using to blow. So you can actually execute OpenStack undercloud backups from your undercloud and it will create this database dump. It will create all the file system backup and it will store these in a terrible, in a Swift container called undercloud backups with a timestamp. Of the date, you have created the backup. How this works, so what we did was to create a new CLI option. This is living in the Python triple O client. Then we have created a few, one, actually, Mistral workflow with all the tasks required to run the undercloud backup. We are also doing a few checks like, OK, is it your undercloud kind of, does your undercloud have enough space to store the backup you're creating? For example, where is it possible? I mean, we check if the backup was created correctly and we create this using Mistral workflows because we want to eventually in the future integrate this to our UI. So theoretically, if we have an operator, he will be able to create the backup from the UI directly, not needing to log into the terminal and run open stack undercloud backup. If we have an older version, what we did was to document all the steps required, which are fairly simple. For example, we can execute open stack undercloud backup with the options there. So by default, we will create the MySQL dump. And you can add pass or exclude path. For example, by default, we include the home directory of the stack user. But sometimes, the operator might have there a lot of things that they don't want to backup, so they can exclude that directory. Here you have the options. We usually run the backup by default. We backup the root folder, all the config files. We backup Swift data. And sometimes, we exclude the stack home directory. If you have an older version, what we do is basically to run the MySQL dump manually and we create a tar file with all the files we need to backup. OK, so we have a few strategies here. What can we do to be sure we can actually restore our undercloud? So we assume the worst case scenario, so we restore the snapshot or nuke the node and install from scratch. In this case, we assume we need to install from scratch and the documentation is already available upstream. If you want to check it, push fixes, you are very, very welcome. Reasons, yeah. The use case, we are exercising here. It's a fast forward rate. So sometimes, it might not be easy to roll back these FFU because the June transaction history might be tricky to roll back. You might have packages dependencies that might end up breaking completely the node. So we assume the worst and we reinstall from scratch. And this is really easy because in your undercloud, right now we don't have any HA, so it's very easy to install. How to do it? So very easy. We restore the config files. We restore the certificate files. We restore the databases. And once we have these three things in place, we just run an open stack undercloud install and we should have it once it's finished back up and running. OK, so right now I will hand it over to Dan because we are going to actually exercise a overcloud backup and restore. OK, so the overcloud backup and restore, just to give a bit of context as well. Originally what I did with the inspiration for the overcloud backup and restore is I was showing actually one of our consultants who I think is in the audience, Darren Sorrentino. There he is. I showed him the docs for our backup and restore process and his response was, yeah, this is great, but why not turn it into an Ansible Playbook? And I sort of thought and went, yeah, why don't I? So I've got to give credit to Darren for spurring me down this path. So the goal here is to try and create something that's composable and agnostic to automatically backup and restore an overcloud. And originally I developed a playbook to do this. It was a monolithic playbook. It was suggested by some of the OpenStack devs to integrate to this role that we've got here. So Ansible role, OpenStack operations is the location on the Git review in OpenStack.org. And I've developed some foundational tasks within that role that's currently under review at the moment. We're currently testing it to make sure that it fits all use cases that we're exploring. But it does a couple of things. So for a start, it allows you to set an external backup server and automatically configure it. So originally I had it backing up to the undercloud. That doesn't suit most use cases. Most people would have an external backup server that they'd be backing up their data to. Another thing that it does is it does bootstrap node assignment as well. So if you have a cluster of database nodes, a cluster of controller nodes, it selects one to do certain operations as well, so certain pacemaker tasks, for example. A lot of the tasks are built, or the taskfiles incorporate the Ansible synchronized module, which is an rsync wrapper. And it does delegation between the external backup server of your choice and the overcloud nodes as well. It also provides temporary SSH access to the nodes as well, which is required for that delegation. It can be kind of fiddly to work. It took me a couple of tries to be able to get it to properly delegate, to sort of work out the delegation path. But it provides SSH access to the nodes and also disables SSH access from the backup server to the nodes after you've finished doing your backup. And I've created some foundational tasks for the database backup and the database restore, only for containerized HHA at the moment, so testing it out on OSP13. And then also some tasks to validate the database as well. And what I am to do is also include more services within this as well, so to be able to backup pacemaker, Redis, Swift, and everything else, and also to backup different back-end architectures. So you're non-HA environments, you're non-containerized environments so that it can incorporate some legacy clouds as well. This is an example of a playbook that calls that role and pulls the tasks from it. So as you can see here, it's mainly just five tasks that's pulling from. So this is the backup playbook. So the first task, initializing the backup host, what that does is it makes sure that the backup host has an SSH certificate installed, and also makes sure that our sync is installed on the backup host as well. Then it validates Glow just to make sure that it's all in sync, enables SSH, backs up MySQL, and then disables SSH. The restoring the overcloud playbook is kind of similar. As you can see, we've got the OpenStack Operations role, we've got the initialized backup host. Here you can see we set the bootstrap, enable SSH, restore the database, disable SSH, and then validate the database to make sure that everything is in sync. Also worth mentioning as well is how I've pulled the hosts from here, so from your inventory file, usually you'd have one host as your external backup host, and then you'd have the controller nodes. I'm using for the default there, for the backup host, I'm using as a default backup as the host group, and then for the controller nodes, or the database nodes, MySQL as the default there. I'm only doing this for one node on the backup because I only need to perform these tasks on one node, but for the restore, I've got to incorporate all because there's some pacemaker setup and also database setup for initialization for all the controller nodes when you're restoring the database. And let's give a demo. Now I just want to clarify, this is a live demo, anything can go wrong, so okay. So to show you what I'm working with here, I've got an overcloud deployed that contains five nodes, so three controllers, one compute, and I've also created a role for backup nodes as well, so I've got one backup node there, and that's the node that I'm gonna be backing up my database data to. So this is all managed within triple O, and this is basically just a generic node deployed, the only thing that I've really done is enabled subscriptions to it and installed assencon to it, which the playbooks do anyway. So it's no extra composable services, it's just a generic node. Now I've got here a couple of scripts and backup.yaml and restore.yaml are the playbooks that I'm using that you saw before. So let's start by backing up the database. So as you can see, it's doing the backup server initialization, it's allowing temporary SSH access, now it's actually doing the backup, and copying it to the backup server and removing temporary SSH access. Just to show you that file as well, that script that I ran, all it is is just the Ansible Playbook command. I'm using the dynamic inventory script that's included with triple O, so that pulls the host groups in there and I haven't really done much in terms of changing things and uses the default host groups there as well. And I've also set the backup hosts here, this is an environment variable to set which host group for delegation purposes. And I've also got a backup directory and so I've made this customizable so that you can specify, say for example, a date. So if you wanna do daily backups, you could say set your backup directory to be based upon date and that way you can have a daily backup there. And the only other thing you hear is dash M which isn't actually used by the backup, it's actually used by the Restorer, I've included the Ansible Modules directory because that includes the Pacemaker, the Ansible Pacemaker modules that are installed as part of the underclad here. So this is sort of a requirement because it does Pacemaker management as well. Okay, so just to sort of show you on the backup server, so I've saved to a directory called backup test and all it is is just a tar file containing the database and the grants, so the dumps from MySQL on the overcloud just using one of the nodes. Now, to sort of simulate data corruption, I've got a script here called data corruption and basically what it does is it just adds a couple of open stack users to the overcloud that are just gibberish basically, just to simulate data corruption. Yep, so as you can see, we've got a whole bunch of randomized users there and I've got randomized passwords. I don't want them there, so I'm gonna do MyRestorer. So using that playbook, I've got a script here that calls that here and as you can see, it does a very similar, in fact it does pretty much the same options as the backup script. So let's restore the database. So it's doing the initialization of the backup host. Okay, set the bootstrap status. So as you can see for some tasks, they're only performing for one node. So that's the bootstrap node. So in this case, it's overcloud controller zero. Now it's shutting down Galera. So if I go to the controller node, you can see that it's starting to disable for the container, the MariaDB container there. So it's demoting, HAProxy's reported that the database is offline. Yep, so it stopped on two nodes. So we'll jump back to the undercloud and this should continue on. Sorry? Calcay. Oh, yes. Okay, so now it's copying the corrupted MySQL database and it's creating a container based upon the same container that was running Galera. So it's using that same container to create the database and to initialize it. And the reason why I'm doing it through a containerized approach is mainly for user permissions purposes because the container uses different user permissions to what's actually on the host. Okay, so now that we've initialized the new database, we re-enable Galera. Yep, so you can see that it's starting on the nodes. So it started on zero, starting on one, and starting on two. And now it'll promote to master four of them, making sure they're all in sync. That's okay, that's because it's starting to be in sync. Yep, so the final nodes promoting to master. Okay, checks the database is active. Now it starts importing the data into there. And I'll just check to see, I don't think I've got the MySQL config here, but if I go to the, oops, puppet generated, seem to be there, that's okay. I was gonna show the data actually being restored in MySQL. Usually this takes around about a minute or two. Does anybody have any questions while this is an active, this is active, sorry. Yep, feel free to come up to the mic, take a question while this is restoring. Do you support any encryptions in your scripts or like GPT configurations so that the backup can be automatically encrypted? Not yet. Probably the only encryption that we're using because it's async, because it's SSH based, basically just plain SSH at the moment. But I mean, that's something that we can look at as well. Did anybody else have a question? I was wondering specifically if you have, when you were talking about restoring the undercloud, do you support some kind of, when you have like a, let's say you have a physical undercloud and the hardware breaks, and then you have a blank server basically setting up from the backup. Because yes, you said like restore the database, but how do you, we don't have a database to restore to? I mean, that's what we're actually assuming. We assume that we don't have anymore the physical node to like reuse it, so we reinstall it from scratch. So you just move the damage node to another place, you plug a new one, and you use the backup you have to reinstall it from scratch. The thing is that we have everything, we have the config files, we have the NIC configurations, we have the database, so we have all the resources we need to restore it from scratch. Okay, I thought you were talking about like a rollback, but you're a restore, okay. Yeah, when we speak about the overcloud controller nodes, we actually roll back them, but with the undercloud it's different because it's very simple to reinstall it from scratch, so that's what we are doing. Probably the only other thing to add to that as well is that the hardware configuration has to be, or at least for example, the network configuration has to be same as the corrupted undercloud as well, so it has to sort of match there, or otherwise it can, you can't communicate to your overcloud, so that's probably the only proviso there. So take some more questions at the end, or just because this is finished, so as you can see, we've removed the temporary SSH access and it's finished with the backup restore, and it's also checked that the Galera cluster is synced, and if I go to the overcloud, the data corruption is gone. So yeah, so that shows just a basic demo, kind of a proof of concept using MySQL and Galera, mainly because that's probably the most complicated and time-consuming. Normally for a backup and restore in a database, manually it can take, say, an hour or two. With this, I timed it, it's 30 seconds for the backup in five minutes, thereabouts for the restore, so it can be a pretty big time saver there, and not only that, as I said before, you can have it automated on a daily basis and do daily snapshots of your database there. Back to the slides. So one thing that you should talk about, did you wanna talk about user workloads, or should I? Okay, so in terms of user workloads, so we sort of touched on this previously, that we're sort of focusing on the control plane and the backend services. If you're looking for a user workload solution, I mean, there's a couple of different solutions out there at the moment, just to mention two here, so I'll be one that's a commercial and one that's an open source, so Trilio, which can do backups of your workloads and also configuration as well. And Frieza, have you used Frieza before? Yeah, I mean, the use cases should be similar, so we have solutions to do disaster recovery. What we're doing right now is to provide a solution to make a quick fix for when something goes wrong, but if you want to exercise doing correctly all the disaster recovery, you should probably use one of these tools. And also Trilio can be installed using Director as well and acts as a horizon plugin as well for that too, so and it's quite easy to use to be able to back up your workloads there. Some of the challenges that we face, so testing has been a bit of a challenge because there's always things that we can miss in this, like for example, we had tested the undercloud backup, but we didn't back up, including the X attributes, so that caused issues and we tested so that it got to the point where it reinstalled and we tested some services, but when it came time to use something that was Swift-based, which the undercloud is, it stores its plans, the triple-head heat templates and Swift, when it came time to access those, we couldn't because there was no metadata there and Swift would quarantine the object for the file that you try to access. Excuse me. Another challenge adapting the tasks for different versions and services, so this is something that we found out from the fast-forward upgrade from 10 to 13, it's quite a big change because you've got some components that are deprecated, you've also got different data formats as well, like for example, with Redis, I think it uses one version, then it upgrades the version, so that tends to get missed as well, so when we tried to roll back, we were still using the latest version, we had to roll back to the previous version as well. And also maintaining backup solution over new releases as well, because there's things that are impossible to sort of tell with any new, say, new project, any new service, and existing service as well, there can be things that out and so, for example, the telemetry services changed dramatically from OSP 10 to 13, so who knows what could happen in the future, so it's the sort of thing where we've got to try and anticipate what's happening and come up with a backup solution for that. And ideas? Yeah, for example, one simple, very, very simple example, it's that, for example, in OpenStack Queens, we don't have containers in the undercloud, and we are actually basing this backup solution for a non-containerized undercloud now, for sure it will be broken because right now we have containers in the undercloud, but we are not running the backup accordingly, so it takes a lot of work to maintain all the different releases, all the different versions, all the different possible scenarios that operators can do, so that's really, really a big challenge. So we have a few ideas, for example, we want to create something like, we can call like, impossible backups. We can, we have, for example, in triple-hit templates, we have a section where we define the great task for each service, so we can actually use the same templates per service to define the backup task we want to run for each different service. We don't actually have an official place for the control plane backup workflows, so we should probably find an official place to put this information available for all the community. We also need the help from all the squads in the upstream community, because we are actually not experts in all the different services that we can actually backup, so it's something that requires a lot of collaboration between the different teams. And we don't have a CLI option to run the backups of the overcloud controller, so it might be a good idea if there are any collaborators that want to help with this effort, it's pretty much welcome. We can create sort of things like, for example, an option, a new CLI option called OpenStack overcloud backup minus, minus controllers, and we will backup all the information needed to actually restore the controllers if something goes wrong. And our thing that we don't have, and a triple UI, we can actually call the undercloud backup from the UI because it's in a mistral workflow. That's basically the reason we encapsulated the undercloud backup in a mistral workflow to be able to call it from the UI, but it doesn't exist yet, so would be nice if people wants to help with this. And I think this is all. Thank you for coming, and well, if you have any questions, we are here for you, either here or outside in the hall. Thank you.