 All right, so I'm gonna I'm gonna get started bear with me since I don't have a clicker I'm gonna have to be kind of looking at my screen a little bit here and there, but Good afternoon everyone. My name is Roger Lopez today. I'm gonna be giving you a presentation around backup and recovery of OpenSAC specifically your overcloud and talking a little bit about essentially how to back up and recover from a Failed update so to kind of give you a Kind of an agenda of what I'm going to be covering today Initially, we're going to be covering a little bit about the open stack environment that I used to do a backup and recovery We're going to talk about key differences between updates versus upgrades a lot of times These words get interchanged and for the prep for this presentation due to time constraints as well as Talking about updates versus upgrades. We're going to be specifically focusing on updates We're going to be talking about the backing up of open stack environment with regards to the control plane so that's going to be including your Opus stack databases, so your MongoDB database your MariaDB database your Redis database and Of course all the different open stack configuration files And then finally how to restore those different open stack Services on your overcloud and then the bonus which I'm not going to be covering But I added it on the slide deck if you're running red hat open stack platform. How do you back up and restore your under cloud environment? So to give you a little break out of what my open stack environment consisted of essentially I had three controller nodes with high availability for compute nodes and for stuff notes and each node had four nicks on them and Each Nick had its own Essentially network associated with it and and with the VLAN So if you look at the controller node as an example, I had my controller one em1 it was connected to my external network Em2 was connected to my provisioning network E P3 P1 was actually broke it up into two VLANs one for my tenant network and one for my storage network And then finally P3 P2 was for my storage management network And that kind of just gives you a breakout of how the networking was distributed across The different nodes for the open stack environment that I had So Key differences between updates versus upgrades. So a lot of times, you know, as I mentioned earlier we Interchange these words, but with regards to updates I'm referring to in place with regards to updates I'm referring to in place of updates of a particular open stack version So in the case of red hat it would be red hat open stack platform 8 if you are doing upstream stuff where it's Liberty You'd be minor minor Updates of that version with regards to upgrades I would be referring to red hat open stack platform 8 to let's say red hat open stack platform 9 or let's say Liberty to Mintaka So in the presentation, I'm only going to be covering Updates one of the things to keep in mind is One of the reasons that I'm not covering upgrades is one due to time and another thing is the the whole concept of how to Back up and restore is similar with regards to if it's either updates versus upgrades But it's also a little more complicated with upgrades for the simple fact that as services change or get deprecated and you move on to major releases It's possible that that's a part of it that will Complicate it for example like cilometer moving on to no key. That's something that Rolling back becomes a little more of an issue and something that you have to look at but since we're focusing on on updates and in place of with a specific version I'm going to show you a breakout on how to do that So backing up your over open stack environment one of the things I wanted to do was kind of make a timeline, right? so There's a lot of steps that go involved in backing up and the last thing you want to do is just kind of jump in and try To figure out what that is. So this is kind of a breakout five steps. This is how you would back up your overcloud So I broke it up into backing up your Maria DB backing up your Maria DB's users and their different permissions backing up your MongoDB database, which is holding your cilometer information or for in the case of open stack red hat over sake eight Backing up your Redis database and then of course backing up all the different over stack configuration files And then storing all that in some remote location So the last thing you want to do is back up all this stuff and you leave it somewhere that's That you're gonna lose that information if you actually have a catastrophic catastrophic failure, right? so going into The first step backing up the Galera cluster Maria DB well I kind of broke it out into four steps a b c and d. So the first one Access a controller node via your root user Identify which node is being targeted by h a proxy So one of the reasons that we want to find this out is because when we do the backup if You have multiple controllers You want to actually make sure that you're using a controller that's idle to do the backup versus having Versus using a controller node that's already doing work in the background might as well use one that that's not And then ensuring all your Galera nodes are synced up in the cluster, right? You don't want to take a backup of one controller node and realize your database is not synced across all your different controller nodes and Yeah, and of course taking that backup, which I mentioned a little early so So what do these steps look like I kind of put a B and B in there It kind of just to show that they're in part of the same process, but step a was control accessing your controller node So essentially here I'm logging in SSHing as your heat-admin user and then getting into pseudo for root access and This grep command here looking at the h a proxy that come fig file What I'm looking for is to see which node which over cloud controller node is being targeted by h a proxy So when I do this I can actually see that hey the 172 16 dot 1 dot 11 IP address was is being targeted by h a proxy so what I need to find out is for with this IP address which Controller node is being accessed. So I can actually use a mysql command shown here That actually tells you which particular controller node is being Accessed by that particular host and for the here and I put your user. You can actually use any Open stack user so you can use a heat user glance user Whichever user as long as you have the password and you can access the database You can actually find out this information or you can go ahead and create your own database user if you want to do that But you kind of I kind of left it open that you have options so Once you have the information of I okay, I know that Where was I okay? Yeah, so I have the host name I know which controller is going to be accessed So what I want to do from that controller zero node is I want to make sure that my other controller nodes are Sinked up with the database so in this case you can do a curl of your over cloud controller one controller two in my example On port 9200 and here what I'm just trying to verify is that the database is synced across the galeric cluster, right? Once I know this since I don't want to make a backup on my over cloud controller zero since that's the one that ha proxy is Using I'm going to use either controller one or controller two the option is yours. There's no right or wrong answer there And then finally what you want to do is you want to use a sequel statement to back up your open stack database So basically what this my sequel statement is really doing is it's selecting unique Table schemas from the information schema table And making sure that the storage engine is storage. I know db And you're not taking a backup of the my sequel table schema and the reason for that is you're going to actually recreate it When you're doing the restore process because you're going to essentially have a brand new one And then we're passing all that into using the xargs command Which is essentially going to Take in it reads the standard in streams of data and converts each line To space separated arguments in the command line in this case. I'm passing it to the my sequel dump So it's going to take for every table within your maria db And break it out and then finally with that it's going to create this long Open stack underscore database timestamp.sql file Which essentially you're going to replay when you want to restore your maria db So that essentially covers the piece of backing up your maria db database But with that you don't have The users and their permissions that are associated with that maria db So once again, you need another my sequel statement And this my sequel statement essentially is just using the concat function in my sequel And it's essentially ensuring that for all the users on the database that have a length greater than zero Once again, I want to pass it to xargs And with some sed manipulation You're essentially going to get this grants timestamp that sequel file that you're going to create And in here, this is an example below it of what that grant what that grants timestamp that sequel file looks like So in here first line grant usage on star dot star to selometer identified by password So that's really just creating your example in this case the selometer user and granting access that you can log into the database But that's all it does. So the next line is the important one, which is the grant all privileges on essentially the selometer database to the selometer user And of course what you want to do is through that sequel Through that my sequel query that we're doing and put passing the grants timestamp that sequel file It's actually going to collect all the different users that are in your maria db database And capture that and kind of essentially do what you're seeing here So in this example have selometer glands heat, but you would see all the different open stack services that you would have So now that you have the maria db, you've got your users and permissions The next step is going to be backing up your mongo db database So in rehat open stack eight essentially mongo db is storing a lot of the telemetry selometer information And one of the things to keep in mind to do your mongo db databases You want to ensure that you're making the backup of your database on your primary node And you want to use mongo dump to do that And of course you want to ensure that the backup that was created was essentially Is there so in order to do that this is kind of the steps to do it So First thing you want to do is you want to log into your mongo db database and in order to log into your mongo db database I need to find within my mongo d.com file Which is the ip address associated that's connecting to connect to my mongo db database So in this case the the bind ip is what you're looking for so in this case I know 172 16 1.17 is the host to connect to my mongo database So next command this is what you do to connect to your mongo database You can do this from any controller node And when you log in What you're hoping to see is something like triple o underscore primary So essentially what this is telling you is that hey, this is the primary node for my mongo database This is where you want to make the backup However, if you see something like this triple o underscore secondary This is just another member within your cluster for your mongo database And you actually don't want to make the backup here So what you want to do is you want to find out from within that controller Which is the host that is primary So in this case if you do rs.status and rs stands for replica set It'll give you a list of all the members that are part of the mongo database And this example I kind of give you a little small snippet of which is the primary So you'll get all this information and if you notice name is the one that's giving you the IP address and the port associated which is we're looking for And then the state is telling me that this is the one that's primary So before I was trying to connect to dot 17 and I realized now at dot 15 is my primary So this is the node that I want to use for my For my backup So how do I how do I do that actual backup right? So you want to log into your controller node that has that has the primary node right for your mongo database You're going to create it like a directory that you want to store your backup And then you're going to use the mongo dump command in order to do the backup So the mongo dump op logs, which is essentially your operation logs And this is a special collection that keeps all the rolling records of all the operations That modify the data and then you essentially you want to pass your host your port And then finally the dash hash out is where you want to Locate your storage backup So in this case it would be the directory that you created previously And then of course finally when you do that mongo dump depending on what flavor of opensack you're doing You're going to have in the case of rehab sec eight. It was you'll have two databases. You'll have the Admin database and you'll have a salameter database So within that location you want to make sure that you have those two databases stored so With the mongo backup You also have the next the last database that you need a backup, which is the redis database So the redis also stores additional in red head opensack eight also stores additional salameter information and it's mostly used by If i'm not mistaken the twos library so kind of for coordination and locking so In order to do a backup of the redis database You want to make sure that it's up and running There's a command within the redis client command that's called bg save which stands for background save And then you also want to confirm that you've created the database that it's been backed up So there's a command called last save and then of course Check the timestamp when we're doing the backup. So they kind of give you what that looks like So once you log in The redis database has The redis.com file has some information that you're looking for so example What we're seeing here is The you want to get the bind ip address and you want the master off is the password So using the redis dot dash client command You're essentially going to dash a past the password that's provided from the redis.com file That's h past the ip address of which is the database is located And then you're passing the ping command and the ping command is essentially going to give you a response of pong If it gives you a response of pong, you know that your redis database is up and running And then of course you want to take the same redis dot client command And use the bg save command so bg save is going to make a background process to save your database But before but you want to make sure that you know that your database was backed up and has finished backing up So you want to use the command the same redis client command But now with last save at the end of it and this is going to give you a integer value in this case like 147 and so forth And this is just like a unix timestamp So if you want to figure out what that is in human You know readability essentially you can just use the date dash d at that integer value And it'll give you a timestamp of when the last backup occurred for the redis database So this way you know that what you're going to back up for redis is what you're expecting And the last process of backing up your opus stack is all the configuration files So Depending on what flavor you're on there's a lot of different services that you need to back up right So I kind of give you a manual process of what you need to back up So all the etsy service bar lock service bar lib service your etsy sys config memcash d And your serve node and then essentially you want to just kind of tar all that up and store it somewhere remotely Option two that I kind of put out was in under that bit.ly link I have a essentially a script that I created that will go off onto every over cloud node that you have on your environment And make a backup For all the different services right so then when you go and back up you're putting it For every single you're putting it back for every single over cloud node that you have in your environment So that kind of describes what you need in a nutshell what to do for backup But before I get into the restore process Everyone here i'm sure has run into some sort of update failure, right But the reality is you don't necessarily need to do a roll back of your environment because you had a failure Right a lot of times you'll be able to go in Figure out what this particular error was and rerun your commands to do to do the update or continue the update If you ever have a issue where you cannot do that and you do need to roll back This is what these steps are for So for the restore of the open stack environment, I kind of broke it up into the same type of timeline, right? You got less steps now. I'm giving you four set of five this time around But uh, basically it comes down to step one. You're going to want to roll back any of your young transactions. So when you're doing Uh over cloud update the first thing that it does it tries to go into every over cloud node and update all the packages to This newer version, right? So we want to roll back all those Transactions we want to then restore the maria db cluster Restore the mongo db database and then restore the open stack configuration files In that tar gz file which includes the redis database because the redis database is stored under var lib redis So that dump file you you will have So step one the rollback. Um, that's where young history is your best friend So with the open stack timestamp, um, you can use this as a reference to when you Essentially when the when the update was going to happen, right? So When we backed up open stack database timestamp.sql We had an idea of when we did a backup Assuming that we did an update right after that You can kind of use that as a timestamp if you needed to go and check it But in this example, I have here where I did a young history It shows that you know system user id 16 And I had about 77 different packages that were somewhat altered On this particular day in time and the action was either an install or an update of a package So this is where you can do a young history undo of your id So I did a history undo of 16 and this rolled back all the packages of my open stack environment on the overcloud node And you have to do this for every Overcloud node in your environment and you ideally would do it one by one to make sure you don't have any issues with a particular rollback and This will get you back to the point in time that you were prior to doing your backup when you did your backup So the restore process for the glare cluster is Kind of long actually and I I put all the steps out here But what I'm going to do is instead of describing all these steps Through bullet points. I'm going to describe the steps through actual the commands that I ran So the first thing I wanted to do to restore the maria db was to essentially Go into one of my controller nodes get in as a you as a root user again And once again, I wanted to figure out, uh, which Controller node was being targeted by ha proxy, right? So in this case, like as we knew earlier, it's 172 16 1.11 And then what I want to do is I wanted to disable all my open stack services that are not running galera So the goal here it will vary depending on which open stack version you're running But in the case of rahat open stack 8 when you disable open stack keystone You actually disable a lot of open stack service dependencies And then once I've done that the following pcs resource disable has all these different other services that Are not necessarily dependent on open stack keystone So kind of in a nutshell the important factor here is make sure you stop all open stack services But leave galera up and running So with all the services stopped you want to confirm that with pcs status And then the other thing I want to do is I want to make sure that I want to drop Any connections to my database right since I'm going to do a restore I don't want to have incoming connections coming in at the same time possibly modifying my database So you can simply drop connections using an ip tables rule And then finally On this one step here the galera you want to disable control from pacemaker and this version at least And the reason for that was If I would shut down my database, which you're going to have to do manually to in order to restore it Pacemaker would automatically see that hey this database is down right now So it would automatically try to start the database back up So in order to not have pacemaker Try to restart the database when I don't want it to I'm actually going to have it I'm going to essentially have that resource unmanaged And then of course finally on your compute nodes. You want to essentially stop any services any compute services so in this In the rel opsec a double is the salameter compute service and the no compute service So now that the maria db is not the galera cluster is not managed I want to go in to each controller I want to shut down the maria db database And then this is the part where i'm going to essentially recreate my maria db Essentially with the whole new mysql. So in this kind of snippet here. I'm essentially Moving my old maria db data directories creating a brand new var lib mysql Setting the correct ownership permissions. I'm going to use mysql underscore underscore db to install new data directory Under the using the mysql user and then set the correct ownership And then restore the se linux settings for it So once i've done this I have a pretty much a clean Environment of clean maria db database that has no other information except for the mysql schema So on each controller node you're going to log in To the mysql and you're going to create a user so you can log into we call cluster check So essentially i'm granting all permissions to a cluster check on that particular local host And i'm not defining it a password you can if you want to And then once i get out out of that i want to shut down that particular database And what i want to do is now that i've created all this and it has that particular use that i'm looking for I want to go back to having galera managed because now i want actually my maria db to be up and running So pcs resource manage galera pcs resource clean up galera And the goal here is when you clean up you want to ensure that when you do a pcs status And you grab for galera that your master node Actually shows all your different controllers on there So you'll see this master's array and has over cloud controller zero control over cloud controller one Over cloud controller two that's what you're looking for if you by chance see two of them in there And you don't see the third one in there You probably want to like essentially try to clean up your your environment and it takes a little time for it to happen It doesn't happen right away. So when you do this step Realize that you probably have to wait a couple Sometimes a couple minutes for it to actually put put all the different controllers back into the calera cluster so Once everything's back in your galera once all your three controllers are or x number controllers are back in your galera cluster And you have your maria db database in a clean state last thing you want to do is essentially Put in your open stack database timestamp.sql file back into your maria db So in this case using the mysql command, I can simply mysql that you root pass in the sql statement The next one is going to pass in all those users and permissions that I initially captured when doing my backup So the mysql that you root under less than sign grants date.sql I want to test that The cluster check is working as it should on each controller node. So I can do that locally with cluster check and then using x in it d I can use the curl command and this essentially will Provide you output where say hey galera cluster is synced or it's not synced So the goal is to make sure that when you run this curl command for all your controllers Everything is synced up as it should be after you've done the Essentially the restore of your open stack database And last but not least for the for the maria db for the maria db You want to enable database connections Back into your database. So you want to actually drop The the input command from your ip tables that was actually dropping database ports Or database connections of through that port 3306 So The mongo database so maria db is up and running essentially you have now Where you need your mongo database and you need your reddit database in the config files, right? So mongo database Things to keep in mind you want to always use primary node So you want to log into your primary node? You want to make sure that you're not taking in any right requests? So there's an evaluate command that's going to stop right requests from being taken in as you're doing the import You want to ensure that you have your backup for your different databases that are associated with your mongo database? So in the in the world of redhead open stack 8. That's the salameter database and the admin database You want to drop any existing databases if you had issues with them or or you're doing this whole restore process And then you want to use the mongo restore command to do the The Restore and then finally the evaluate command goes back to a true setting to actually take in right requests So what does those commands look like so? log into controller Before we had just the galera cluster service up and running But now since we're trying to restore the mongo database, we're going to also have the mongo d service up and running So pcs resource enable mongo d I'm then going to log in to my primary primary node So the mongo host bind ip port 27 0 17 and we have that primary node information from when we were doing the backup And here what i'm going to do is there's a in using the mongo command you can actually set So it's connecting to my host its port and the dash dash eval. So essentially this evaluate command sh that set balancer state false is what's going to ensure that no rights are happening to the mongo database Since i'm going to import my My databases into mongo So here connect to your mongo database. You are connected to the triple o Colon primary, right? That's make sure that you're not primary not secondary And here i'm going to switch into the admin use admin database and then I'm going to come drop that database you drop a database using the db dot drop database function And essentially once i've dropped that admin database that was currently living there I want to also drop the salameter database And once i drop that if i do a command show dbs You can actually see that your databases have been dropped from your from your mongo replica set So in order to restore it you want to use mongo restore and mongo restore Pretty pretty simple command to use mongo restore your host your port Dash d is going to be the name of your database. So in this case, I had the admin database the salameter database And essentially you're passing then where the location of your mongo restore was So hopefully you back this up to some remote location You either have access to that remote location to from your overcloud Or you copied it from your your remote location onto your overcloud node And then finally for the mongo database the last step is to reset that evaluation command now from a false state back to a True state and this is going to allow you to once again take right requests on the on the mongo db cluster and So with the mongo database up and running one of the last steps here is now restoring your open stack config files So you have all that all the different services that your open stack environment is going to have You want to now start up all the different open stack services that weren't started because initially you only had the glare and mongo up and running And then once everything's up and running you want to go to your compute nodes and start all those services that you Had disabled on your compute nodes so Um, essentially to do that You have your location of where your backup is this is a tar file Here I copied it to my my slash directory and then essentially i'm untarring this directory now I remember when you untar it you're essentially overriding anything you had existing so keep that in mind and then of course With opus stack 8 rel opus stack 8 Enabling the opus stack keystone service is going to enable most of my opus stack services And then the pcs resource enable on opus xilometer and so forth is going to enable everything else that Opus that keystone didn't re-enable And the same applies for different versions of opus stack you want to enable all the different services backup and running And then once all your controller nodes are up and running all the services are up You want to then start all your different compute services that you stopped so in this case It was the opus xilometer compute service And the opus stack nova compute service And finally that's pretty much my presentation in so after this slide and and I I'm hoping you guys can get a copy of it Um, I put these bonus uh slides on here So essentially if you're running we had opus stack platform and you want to know how to Back it up and restore it I've kind of given a simpler similar slides to what you saw for the over cloud for the under cloud And that's my presentation. Thanks for coming. I appreciate it So if uh, you have any questions, I will try to duck and dodge As much as possible, but if you have any questions feel free to go to the mic and um We'll go from there. Yeah, would you might would you mind going to the mic so that we can get it on on recording? Thank you Sorry. Yeah, it was I'm yours. I work with Cisco. Um, my question was How does this all relate to automation because If we have a big event what we just do is we rerun the automation And except for the data part, which I agree with the configuration everything is just restored by rerunning the previous version of whatever you had in automation So, I mean the way we were looking at this is essentially you're doing a rollback of your environment, right? So The way I understand it if you're doing this automation, you're kind of doing a is it like a brand new environment or I mean If it's virtualized, right, you either blow away the VM, but if it's config based you can just rerun it, right? If it's item put and it will just roll back your Your configuration. Yeah, I think the the biggest factor for for the backups is as long as you can get well I think because you're backing up the databases, right? There might be some changes in within the configuration files themselves So recreating it might not have everything that you're expecting So I think the safest way at least We would be to make sure you have a backup of all your All the different databases that you need and do have all the configuration files stored somewhere else as far as automating it I think it's just a matter of automating that that process to then Pump it back to the different over cloud nodes Could you give an example though of configuration that wouldn't sit in your configuration management or system No, not on top of my head to be honest, but I My concern would be that There might not be there might be a change that you might be that might be overlooked I mean, I'm not Well, then then you have a bigger problem in your change management, right? Right, right. Yeah Yeah, I mean, uh, I don't know. Um I don't have a good answer for you on that one. I'm sorry. No, no worries. Okay. Thank you. Yeah when you update Even a minor version the python libraries are sometimes updated restoring the configuration files won't necessarily Restore the python libraries. Don't you need to back up? The versions of the python libraries. Well, whatever's At least when I was doing the the update for it essentially with the rollback if there's any changes Happening, I would assume the yum the yum rollback the undo would put it back to the its original version So any so essentially when you do the update it goes it goes into the over cloud node It goes and updates all the different packages that it needs to and then it goes off and runs the the update process Right So in at least in the version I ran when I did the yum restore Puts all the packages back to its original state Where I had a working version of I guess python or or any other packages So I think I think that's covered with the with the yum rollback Thank you source. Thank you. Thank you. Appreciate it. Yeah Any other questions? All right. Well, thank you so much everybody. I appreciate it for you coming and um, have a great over stack