 Thank you all for coming to my session. My name is Mauro Lipchuk. I am a senior software engineer in Red Hat Israel. I work in the Overhood Storage Team. Today I want to present to you what we did in Overhood 4.2 and how we basically support disaster recovery solution using Ansible in an automated way. So, first question that crossed my mind when I was preparing this presentation can really avoid a disaster. I know it's probably a stupid question but basically every company, no matter how big it is, how many they spend on avoiding a disaster encounter a disaster one way or another. Seeing the main strength of a company is to recover from it quickly and efficiently with almost no loss of data. So, saying that, let's talk a bit about Overhood. So, I see most of you know Overhood. It's a management, it's a management, virtualization management platform that helps admin to manage VMs, templates and hypervisors, hosts. Talk about the evolution of backup and disaster recovery in Overhood 10, how we got to what we got in Overhood 4.2. So, at the beginning there are a few solutions for backup and recovery. I think that the most used one is the export storage domain. It's basically a file storage domain that was used to backup VMs that admin saw as important VMs. It was a storage domain that was attached at the center. The VM that was purchased to this admin was basically copied to the export domain, the disks and the LVF of it. And once it wanted back, I want to use it after a disaster or after when he wants to use it again, he simply copied it back to the setup. So, this is great for backup solution but it's not really efficient for backing up your entire setup. You can't really get your entire setup into one storage domain. Copy operations are pretty heavy and that basically makes it a bit, not really suitable for recovery. So, in Overhood 3.5 we introduced something which is a bit newer and can provide us a solution for disaster recovery, which are basically two features which are obvious to our disk and import storage domains. With those two features, basically what we did is that we took the storage domain and on each storage domain we have the disks, which are the VMs disks, the metadata of the storage domain and we introduced a new special disk which was called the OVF store disk. What the OVF store disk basically contains is that each VM is represented by an OVF. This is an example of an OVF. It's an open virtualization format. It's an XML that represents the VM, the metadata of the VM and with that OVF, with the disks which are assigned on the storage domain, we can basically recover all the VMs and the templates of the setup. So, basically the storage domain encapsulates somehow the data center or you can get only storage domains and with OVF store, recite on them, you can basically have all the information to recover an entire setup. So, the improvements of course is the recovery process is pretty fast because now the OVF store contains all the OVFs and there are no any copy operations need to be done. All the VMs and the templates in the setup are basically recoverable. So, now you don't have to back up your entire data center. It's basically all there, OVF does it for you. But still, it's not really a fully fledged solution. It's not an end-to-end solution. The admin still needs to orchestrate the recovery process. It's to import all the storage domains. It needs to register the VMs, the templates. Some mistakes could be done on the way and probably failures could happen and if you want to recover as fast as you can, still, it's not there. Second of all, the OVF is a great way to know all the data of the VM but there are still gaps there, still missing parts like for example permissions and the cluster name of the VM which was before. So even when the admin tried to recover the setup, it still needs to fill in all the gaps there. So, it's still not enough. So saying that, let's see what we did in order 4.2. So first of all, we're trying to fill in the gaps. There were a few gaps that were missing from the OVF as I said before. I won't go to each one of them but we're trying to see what is missing and what we can add in there. As I said before, permissions and cluster name. So for example, we added the cluster name. Before that, the admin, when he wants to register a VM, he needs to explicitly indicate the cluster that the VM should be registered to. And now basically, if he knows and guarantees that the cluster is already there, he can do it really automatically. He doesn't need to do anything. So it makes his life a bit more easier. Second of all, what we added, when we talk about disaster recovery, what we want to gain is that to get the destination setup after a disaster up and running as fast as we can and then with the same state as it was before. So for that, we also need to preserve the VM statuses. We probably had VMs before that were running that were pretty, that were probably very important for the admin and we need to have a notion which VMs were running before. Those type of information is not typical for the OVF. OVF needs to gain the metadata. So what we did is that we added in the OVF store disk in this special disk we added a new file that will be some kind of extra data. Part of it is also the VM statuses. So every 60 minutes or every few times that you configure, all the data in the data center is being preserved in the OVF store disk and you can know which VMs were running at the time when the data was backed up. So basically the admin can now after recovery which VM was up and running and decide whether to run it or not. The third thing which we basically did in OVF 4.2 is not really a new feature, is some kind of a defined way for the admin how to make his environment ready for recovery. We defined a way to support recovery process using a separate OVF setup. Now I'm going to show you what were the prerequisites for the admin to basically get his setup ready for disaster. So first of all storage application. As I mentioned before, storage with OVF store disk basically encapsulate all what is in OVF. Basically mostly the VM part, but those are the most important for the admin. So the admin needs to support replication from its storage domains to another site that in case of a recovery can use those to import to another site. Second of all, prepare adjusting case setup. The admin needs to have a secondary setup up and running because we want the recovery process to be fast as we can. He will add their clusters, hosts, running hosts and the replicated storage domains will not be part of it yet, but in case of disaster the idea is that it will import all the replicated storage domains in the secondary setup, attach those, register all the VMs in the templates and run all the VMs which were running in the original site. That way we can gain some kind of recovery that is more better than it was before. I know the third part is marked with red and I will get to it in a few slides. So after all, this how it will be done. The primary site will be disabled basically and the secondary site will be active with attached storage domains. So as I mentioned before, there is the register and of the VMs in the templates but there is a problem here. The problem is that the OVF is updated with the primary site. The primary site has different clusters, it has different permissions, it has different networks and the user needs to make some changes in the OVF for them to be registered in the secondary setup. So he needs to basically choose which cluster to use when you register the VM, for example, an alternative cluster or a different affinity group or different permissions. For that, he needs to prepare a mapping file. So, what the mapping file basically does is that the user will use a pre-configured mapping file that will say when you get a VM with OVF that define permission or cluster which his name is A in the secondary setup it should be B. Using this mapping file with the register operation this done automatically. That's basically what needs to be done. So, writing this mapping files could be exhausting. The user needs to go over the entire primary setup he needs to take all the cluster name take all the users and the roles and the networks and write them down and then map them to secondary setup. This could be a long operation for him and could be exhausting. And that leads me to the solution of Ansible. What we introduced in OVF 4.2 also was a new role which is called OVF Disaster Recovery. You can look into it in the OVF Galaxy it's a role which is part of the OVF repository and OVF basically gives various parts of OVF infrastructure that can be used with OVF in an automatic way. Now what OVF disaster recovery supports today is the generate mapping operation which basically generate the mapping based on the primary site and give you type of template that only the secondary and the target properties should be configured. It also supports failover in case of a recovery do it really automatically and end to end solution and the failback that once the primary setup is being recovered again and the host sets up probably the user will want to go back to the primary site because it has more resources, more computer resources better host, better storage domains. So let's talk a bit about what the mapping generator does. So the mapping generator as I said before it go over the entire primary setup it creates a mapping template automatically and what that means to do is only to add the missing properties. This is a snippet of the mapping generator and we'll show you a bit more in a demo in a few minutes. As you can see here this is created automatically and what is in color there is something that the user should add. How the user can generate the mapping automatically after it installs all returns to the disaster recovery and has the role it needs to define a play which is pretty easy it just need to configure the site, the user name password the file file is basically the output file which the mapping should be done, the mapping should be provided and the roles specify the role name what it needs is a simple command not something too flashy just uncivil playbook the play which we defined before and tags to indicate to generate the mapping. So after that we got the mapping file we have secondary so after you have the secondary setup running you have the replicated storage domains and you have the mapping file already configured and generated now you are ready for a disaster so what happened when a disaster actually occurs? so for that we have the uncivil failover the uncivil failover basically recovered the secondary setup automatically using the mapping file file as we've seen before what it actually does under the hood it import the master storage domains first then it import all the storage domains all these done automatically register the templates register the VMs then it first run the highly available VMs because we can be sure how computed resources we got on the secondary site it's only a backup setup and then it runs all the other VMs to do that just need to define a play as we did before really not something really complicated we indicate the target host it's the secondary the primary source map which we need to use for the registration of the VMs and the templates is the primary we add here in the var file the generated mapping file which we added and the role as I said before command and it works I would like to show you a demo so you can see here at the top that's the primary setup we have a running VM there at the bottom is the secondary setup no storage domains are attached there so here you can see the mapping this is for example role mapping and the storage domains which part of there are the primary site and the secondary site the LAN mapping here for example I wanted to show how we use some kind of role mapping we added the primary name of user role and once we register a VM with this permission of user role it should be mapped to cluster admin just to show that that done automatically so we run a failover I already predefined this play the Ansible is running automatically we first attach the storage domains so as you can see here at the bottom you can see the storage domains attached and eventually we register the VMs and the templates so you can see here that we refresh and now we see one VM now we see the second VM and the VM is running because we knew this VM should be running and just to see here I'm not sure if you can see that but here you see the permissions there of the VM in the primary site and the VM in the secondary site and the role was basically switched from user role to cluster admin as it was before so this was just some kind of demonstration of basically how automated it is the admin doesn't need to do really anything so as I said before once the recovery took place and the secondary site is running and we have all the VMs are running and everything that means still working on that secondary site still do changes which are uploaded on the secondary site and eventually he would want to go back to the primary site so this probably has more strong resources and we need to support all the changes that was done on the secondary site that will be also applied on the primary site so for that we added a failback operation a failback to be used in the with ansible with other ansible disaster recovery the process here is that it gets the primary site ready for import that means that basically it cleans all the inactive storage domains to be imported again then it cleans the secondary site this is being done by moving all the storage domain to maintenance detach them and remove them basically from the secondary site then it noticed the admin that he got the secondary site cleaned and now the replication needs to be switched the primary site storage domain needs to know the changes that was done on the secondary site once the admin say okay the primary site storage domains are updated with all the changes the same operation of the failover is being done again that means that the primary site storage domains are being attached again to the data center we register all the templates with all their updated data that was done on the secondary site and we run all the VMs which ran before first the highly available and then all the other VMs so that being said we now have an automation solution for DR with Overt we still got some gaps to fix and gaps to look forward to first of all we want to manage the mapping VAR files for Overt UI it will be much more easier now it's being used in any file editor we want to fill in all the remaining gaps there in the OVF whatever it needs and we also want to add Ansible a bit more functionalities like admising roles or users in the secondary site and that's also some kind of future plan that we looking forward to so for summary we saw that the OVF contains now more information about VMs and templates that gives us some kind of a way that we can support all the changes and recovery which is more reliable VM statuses are now part of the recovery process so we can know which VMs were running before the admin has more strength to decide which VMs he wants to run again the recovery process is being automatic using Ansible which makes it much more reliable and easy for the admin to manage and we added the failover and the failback of OVF environment to basically give some kind of defined process for recovery solution those are the links a few of them of the Ansible Galaxy the OVF Ansible repo and the OVF Ansible disaster recovery repo which you can see on github yeah that's about it are there any questions or comments yeah ok so if I understand correctly the question is what are the time between each update and which backup of your data to the OVF store disk this is a configurable amount of time which by default is 60 minutes you can change it basically it's configurable and can be changed you asked also something about what happened when for example if you back up and then you made changes and the disaster occurs yeah so in that case basically it won't be recovered by the last minute or last second you will have some kind of a gap of probably a few minutes depends on how you configure it it's understandable and basically it can be configured to be less minutes but that will reflect basically on OVF and how it works because each time it will need to upload the data to the storage domain so it's a gain and a lose situation thank you for the question yeah so you are saying that the VM which is running will not be recovered yeah that's only for the metadata the VM will still lose it's not like it's not like live migration for example that's what I mean that's right thank you any other questions yes okay so the question was that it was related to the process that was being done on the recovery and that we first run the highly available VMs and then run the other VMs and what you mentioned is that there can be some kind of a problem if there is no enough resources to run the second VM and the first VM is running so it could be problematic for the whole process of the setup am I right yeah so for that basically it's the unknown responsibility to have those kinds of resources preconfigured on the secondary setup I know that for a backup setup it's really shame that you will have some kind of hosts that are running and doing nothing and just waiting for a disaster to occur but if you're running really highly available VMs if you're running really precious VMs that you must run then I suggest that that really needs to guarantee those VMs are capable to be running there yeah I mean we don't we don't really guarantee the admins to guarantee themself yeah yeah of course there is the highly available and yeah that's what we do any other questions yes I'm not sure that I heard this second part yeah so for us the engine we are basically working on it it should be supported but the storage domain of the hosted engine will be only dedicated to hosted engine so we support this kind of support basically for supporting hosted engine you will need to have you can configure hosted engine on a secondary setup and the storage domains the whole process will be done basically agnostic to the hosted engine it doesn't matter because the engine is already running and you just import the other storage domain and register the VMs in the templates so that's the use case which basically should be supported any other questions so if you remind of anything feel free to come by thank you very much