 Hi, my name is Amitabh and today we are going to talk about enabling DRAS and OpenStack. Just a little background about Persistent. Persistent is an outsourcing product development company. We have been engaged with all kinds of applications for ISVs, enterprise, and we have been engaged with Cloud for about a few years. We have roughly around 600 people working on different kinds of Cloud technology and about 100 people working on OpenStack. Hari is here to talk about his experiences with enabling DRAS on OpenStack. Hari. Thank you, Amitabh. Hope you are doing good. So disaster recovery as a service is on an OpenStack is what we were trying to do. We tried to enable a disaster recovery as a service on OpenStack. So basically I would start from what's a DR, basically go through those things. So why do we need this in first place on a Cloud or on a on-premise or on your laptops or something like that. So disaster can happen anywhere. As you know on all our Clouds or a nodes, compute nodes, your machines can go off anytime. There could be a node failure or there could be a power failure. There could be natural disasters. There could be software errors. You don't know when your instance or a VM goes off. You must have your DR plan or at least the backup plan ready when you are when you are running typical services or business critical services on the top of OpenStack. So this talk is basically about how do we achieve a DR service on the OpenStack for the loads which are on the OpenStack as well as on your on-premise or around or on your simple private Cloud. So why do we need this DRS OpenStack or rather what I would say why draws is required on the OpenStack. See first and foremost thing is to bring new customers to Cloud. Basically that enables private Clouds or the public Clouds. If you are hosting your public Cloud or own public Cloud or a private Cloud. So if you subscribe or if you allow subscribers to ask to subscribe for a DRS, then it would bring you new customers. The second thing is like some providers or if you are running an application specific Cloud or let's suppose you have a private Cloud where in which the data is very critical you are running some customer services on it and you want to have a backup plan where in which if the node goes off you just want it to do. So then you could actually put a DRS service on a OpenStack cluster which might be a scaled-down infrastructure of your original private Cloud and then subscribe the service and subscribe these VMs to actually do. So that's co-located DR infrastructure. And next as such we all know for the business continuity. Then the DR option should be homogeneous environment. So for example if you have your DR Cloud on your OpenStack and if you have a similar DR, the destination DR also on the OpenStack then it is more susceptible for reliability. So you get the same environment here and you get the same environment there you don't need to adjust, reconfigure, retry and you don't have a doubt that it could fail. And then you should have RTO and RPO. Many applications have a different RTO and RPO application somebody needs. Again I need an instance immediately the replacement instance very immediately or somebody could say okay even if it fails for an R but immediately within R if it comes it's okay. So those are the four main areas a usual DRS service and OpenStack would solve. So let me just take through the agenda what we are going to do in this particular presentation. So first is we will evaluate what are the sources of DRS. So basically we identify the candidates wherein which you could do a DR to a OpenStack, OpenStack based private or public environments. The second is what methodologies could be used or what techniques could be used such that the OpenStack environment could be used as a DR service and what are the complexities and what are the troubles which are associated around this. And then we go through a small evaluation of what are the things today's OpenStack needs for a DR to be enabled on OpenStack. So that's one thing and then next we see what are the next steps, how do we go ahead and put a full-placed DRS service on OpenStack. So if you see the sources of a DR, a DR could be enabled for anything, it could be on a physical machines or a virtual machines which runs a traditional KVM or Zen hypervisors where you just have your machines there and you wanted to put a DR option to a private cloud, OpenStack based private cloud or an OpenStack based public cloud. Then your private cloud itself could have a DR to a public cloud. So this is one case where in which you have your applications or a workload running in a very small private cloud or an application specific cloud. So you are more concerned about the data and then what you could do is you can say okay these VMs you move it to a DR, keep your backups there and if in case your private cloud fails, you move it to the public cloud. So very basic steps here like any DR if you have to do like or not only on the OpenStack or anywhere. So basic DR is like involving on the source side you snapshot. So you either snapshot or you convert it into a way where in which you take the data packets or the changes you know basically the delta changes to the destination. So whatever is happening at the source must be in some way repeated or replicated at the destination. So that's the first step. And second step is like you capture these and transport it across a transmission media. So the transmission media very well or trivial in these days is like you send it over a wire or a fiber cable or some kind of a process where in which you could be able to send this data from source to destination. The third thing is you send the data and you have it to the destination safe. The third thing is you have to detect that okay the source has failed and now you have to start up. So that's one more step where in which you continuously monitor it could be an Agios script or something which is monitoring the source and it triggers okay now I am failed and can you please bring it up on the other side. Next you provision from that particular place. So provisioning has certain challenges that could be the recovery time objective or point objective but those are the challenges which are there in this particular step. So let me just reintroduce the terms which many of us are familiar but for the convenience of everybody you know RTO or RPO is like RPO is the recovery point objective where in which you say okay I if the disaster recovery has happened now how much time before I could get like for example if there is a disaster at this moment okay what is the least point I could get my system from so what's what's what are the unsaved changes I'm going to lose and RTO is like if I have a disaster now and I request to give me a backup what in what time I the replacement service will be available and then delta backups are typically the changes in the data from your last recovery point or last backup point to current backup point so these are the basic things and this is what we we tried and we see it is successful so what what we do as part of it is we took a standard hypervisor which is running a KVM and spawn some virtual machines around and then we have a DRS service in a virtual appliance and then we have one more appliance which is called as a data set data VM so dras service is kind of a landing point where in which all the agents if you have a multiple agents or a multiple sources they come to your DR cloud or the cloud on which the dras is hosted and then the data packets are being sent the dras agent on the source side keeps a config has a configurable way of taking the delta changes so it could be as less as 15 seconds or it could be as good as one day as well so it captures the data delta differences and senses and on a packet on the dras service so that's the first step so dras agents request the data backup and then dras service then what it does is it sees okay what what exact data what exact endpoint is trying to do a DR or do a backup then it assigns a data VM where in which it coordinates the underlying storage to save the snapshot so in this alternative what we are proposing is we use the bootable volumes feature of the open stack okay to save the DR disks this is because because this is because of two reasons reason number one you if you use anything else your recovery time objective will be more because the moment you send a packet and you keep it up with you and while recovery you need to copy it first to glance and you need to again copy it next to to the specific node during the VM boot up so there are two copies of the same data and if you are trying to copy a trying trying to copy a machine of one TB or so it would be around half an hour so if you are considering about a less recovery time then go for the bootable volumes way this this this alternative and then after the data is backed up so somebody has to trigger the backup so that the triggering is like happens from a user or a script that the triggers the data backup when when you say okay trigger the data backup so what what the drug service is drug service does is it goes on checks hey this is this is this is a machine drug service is being has to spawn up so what it does exactly is it it mounts the bootable volume it it asks it issues an over boot commands to to the public cloud or the private cloud or a controller so then when it is issuing it does take the policies which are predefined or it could be given at the dashboard so it could be a reduced reduced flavor or a full flavor or a defined network settings or you can predefine using a config drive and the replacement instance comes up and it could do a self-check and it is available so we have another alternative for people who are not very very conscious on RTO if somebody is okay with a bigger RTO and doesn't want to use a lot of block storage okay then that's that's alternative to where in which what we do here is almost the thanks functionality is the same but instead of having two different VMs one a drug service and a data VMs scaling out together auto scaling to those we use we leverage the swift service so what it does is though it's here it's it's just the agent what agent periodically does a snapshot and takes the delta differences and pushes it to swift and after it pushes it to swift it just updates the meta meta data file saying that hey I have updates so and so on to the drug service so it keeps on going on and you could actually configure after how many DR snapshots are backed up you could match so periodically on the on the destination side these snapshots are matched periodically but usually you could keep this on and on going so what happens is so in this alternative when when there is a in this alternative when the when there is a DR request or a disaster request then the drug service detects which which VM is required to be spawned up and then what what it does is it first merges all the snapshots and makes a feasible and makes it and makes a feasible image out of it and then it first uploads to the glance and after it uploads to the merged merged image and glance and then it starts the compute to spin up the replacement VM and then the replacement VM would spin up and be available for the services this approach has a specific advantage that you don't need a support from any provider so it could be any vanilla open stack don't need a change don't need a lot of resources for the draft service itself to be run so you just need swift you just need a compute the another advantage for with this is if you are in a scenario where in which there is a very less amount of bandwidth that is available for the cloud provider so if you are bandwidth conscious of saving the data you could use the CDN services of the swift and then you could actually back up to a very near location so this will be very useful if you are if you have a machine that needs to be backed up at a very remote location or a or like a machine across the instrumentation machines or data machines across the lines of a railways or telecom points or those kind of stuff so if you really want to save data from there you just have a and you have a very small data pipe you could use a swift and which is over a CDN service and then you could do that they even then you could be able to recover that that so there are two alternatives which but in this case the RTO will be more than what is in the alternative one so there could be two ways okay but RPO this alternative will be more suitable for things where in which okay you say you back up it one one day once a day or so so the alternative one would be suitable if you want to have a frequent back backups and alternative to which is presented here will be suitable for if you for cases where in which okay once a day is fine and this is a steeper solution with respect to the first solution and next there are some challenges which we wanted to specify so while implementing this and while trying this out is one is like the huge data which is involved in when you are moving from the source to destination so if you are writing a disk in intensive programs or if you are saving a big big video files that would mean that your delta files are big and that would mean that you are going to choke the data pipe between the source and the destination cloud that's one one thing which is which it still remains that is still a challenge we didn't find a way where in which you could slice the delta backups so if if either you send the entire packet a large packet in that particular time or you don't send it but that's that's one challenge and in the RTO and RPO times at times we see requirements such like you will have to boot within less than a minute so although we we have all those things like applying these delta backups at the destination we're taking a little more than time than less than a one minute so one minute kind of a story or two minutes kind of a story of a RTO and RPO are particularly very challenging when there is a lot of data which is generated at the source and next like some people are not ready to change their open stack installations not even agent they don't want you any agent to sit in their hypervisor or anywhere that's you know that's that's one that's one challenge we wanted to still make it working so people are concerned about installing agents at the agents at the source where in which it's like a laptop or a physical services people are concerned about adding more capacity to sender you we will put things on sender so that's that's one one of the challenge we see and application coising is one thing when we when we wanted to take this DR to even applications coising where in which DR should be done in group of clusters it was always a challenge because the way you the way we use we use the single data pipe and it's like there is no way all the three machines are synchronized so it there should be some intelligence which is on the agent side also you there in which you group other clusters or group the VMs into a same group and then you when you when you do a DR on the other side you you do in a synchronous manner so that's one thing and small some small implementation issues like like trivial open stack style of implementation issues still exist like this is a flag and this is a flag and some security rules you have to add these configuration files and those things and these are the challenges part of it and I would go to next steps next steps is like we wanted to put a incubated project for DR as an open stack and then the disk replication today we what we are doing is we are using a disk replication agents on the source and destination but if we have a disk replication service such that okay can you replicate this VM as to a destination or something like that and it would be more useful from when you are doing a DR from open stack and to open stack or vice versa and then like the agents story so so today we have a challenge that if we say can you put this agent on a guest VM your guest VM people are very concerned about the security of that because they don't they didn't evaluate it or they or they didn't see what's what's in that so making the draws agents which are making draws agents public and making them as part of a open stack releases is also need to be considered such that you have the agent which is tested tested across your security and you just trust okay if you are if you are okay to run a open stack code you are okay to run the open stack agents as well so that's all from my side so do you have any questions please you talked a lot about the infrastructure piece of you know the DR moving of EM getting it to the DR side etc etc but there's other pieces involved such as network configuration and so taking a snap of a VM and moving it somewhere is pretty trivial right there's those are ways of doing it but the actual bringing the VM online and making sure that it's on the right attached to the right network in the same VLAN is there needed be the same AD server by pointing to the DNS and then it going a bit deeper the application configuration points to a certain host name convention so the D the domain name sub domain name change etc etc how are you planning to do with that that's great so so in that scenario we wanted to go to the open stack way of describing the workload so if you are we didn't talk extensively about that because this is this presentation was mainly from you know either if you are migrating from this if you are migrating from a on-promise or a laptop to the destination and here there was a piece which is called a dashboard okay in the dashboard or the API could actually have these information of the network or so but when you are doing from a open stack to open stack you could always use a config drive the way we used is a config drive or you could actually write out things around you know the APIs to support the workload description it's not just the it's not only just the network configuration we are concerned about sometimes there are attached volumes also yeah so there are many services there are also services which are there so the right way to go about is to have a workload description properly and then when you when you take it to the other cloud you just reiterate the same workload description so it's not only the disk changes because the talk actually considered concentrated on the disk changes and how do you move a VM but actually it also needs to keep track of the changes in the metadata that happens to the VM also also the owner and all those stuff yes I agree with you on that yeah thank you so in your option one you're proposing to use volume block storage did I hear you that you're proposing that the agent should be doing Delta storage into block as well so we have the you know you're doing VM deltas yes yes and so is there also the idea then of re-accumulating those deltas into us into a single boot volume periodically like you have an option to yes you know like the delta you what we should do is we should keep the bootable volumes exactly in sync so if there is a delta packet that is sent from the source you just extract it then and there so you don't you don't accumulate the deltas and then when the DR happens you don't apply them so if you get a delta you apply that so if you have you done a proof of concept on that yes and that works yes that works okay and so you've got so you're applying these deltas directly to the to the volume to the bootable volumes and in real time yeah it's real you know in a real time so I get a delta and you apply so we ensured that the source disk is exactly replicated by the Cinder disk so at any given moment when when it is consistent so the Cinder is exactly Cinder is replicating the source they both are on sync so when you boot up you just boot up from the volume so there is no copy which is involved so you can get a big you know very less RTO yeah gotcha all right so and then the agent itself you're proposing that that is that's something that you guys wrote in in your proof of concept is that correct in our proof of concept we have used a agent which is proprietary today so so we wrote you're on right right and so moving this into an open stack project how would you propose to deal with that agent capability is that is that a new service is it a component of each program of each project how would that get delivered so the whole concept of this talk is to keep that keep the changes in the each projects very less so we don't intend to change any project so we just want to put one in new incubated project and say this is going to have the services and we keep the changes in any of the projects to the minimum level possible such that we just use whatever today is on open stack but if that would mean that there are some little changes to be done we wanted to go ahead and propose that through blueprints and get it done and on the agent side agent side should should come into a ecosystem or on a public side where in which there would be agent for a physical machine windows operating system or there could be a agent for a KVM based hypervisors Zen based hypervisors for VM VM Bayer kind of stuff so all these agents you know they store their data in a different different way so it is the that intelligence that should come from there and why a different agent is tomorrow you know down the line we should be able to even backup our mobiles you know you have your mobile you install agent on a mobile you send the entire data of a mobile you have the copy some day you lost your mobile you at least have all the snapshot running on a simulator that that's the whole thought thought process behind why you wanted to put a agent per you know per kind of a device we're still trying to figure out you know how to contribute what we have and whether we need to do do a different blueprint or a different mechanism to distribute what we are writing on the agent side still working the details out but we'd like to see it go out okay thank you thanks hello our RMS interesting talk I'm interested in the testing that you guys did what quantity of VMs have you failed over and what do you perceive in what services do you perceive an open stack will be under stress during a DR failover scenario okay we tried with a five or ten VMs in our lab so then there are a couple of VMs getting backed up regularly so answering your second question definitely it is the input data input pipe and the end points where in which your VM is running so I don't think it will be stressed stress on any open stack services as such either the neutron or Nova or cinder or whatever it is but there would definitely be a stress on the network back play in which you are taking because if you are backing up several hundreds of VMs and the Delta changes particularly where if there are a huge rights on the discs and the discs are changes are big so your data pipeline is going to be loaded or stressed and if you have if you don't have the capacity planned enough to that level then your network may go slow but as such it's only a couple of Nova requests such that you don't see a big stress on running open stack services or so but you will definitely see the stress on the discs and the disc rights and disco I was and the network you know regular network traffic of the open stack did I answer your question so I have a question around you talk about how to send Delta from the primary site to the R site like you have any thoughts around how to move back the workload to the primary site after the disaster that the cause of disaster is removed do you keep deltas in the disaster site as well you just push back the deltas to the primary site or do you have to copy back all the data so we haven't done that work right yet yes we didn't do that work but that's a very good question there are some challenges which are involved so if your source is a virtual machine and if your source is a KVM kind of a hypervisor we have no problems moving that back you just move that back and rename and can turn on it as a different virtual machine you have all the changes back but you will definitely have a challenge if you if you have turned on a physical machine so there the blocks are different here the blocks are different so it depends completely on a very intelligent agent and if you are able to save that metadata also and when it comes back you will have to reapply and you should be able to reapply on the source so fail back is still debatable or at least I should I should say I haven't understood till now so I haven't have I don't have a way to answer that just a just a quick background right persistent also has a small disaster recovery as a service offering for SMB which runs out of a different division right and we are learned we have learned and leverage a lot from there and we are incorporating all some of the complexities which we have learned in this experiment which we are doing to see how we can take what we have and move it on open stack hi I'm concerned somewhat on policies what kind of retention policy you might have especially when VMs come and go effectively they no longer have any use in life and you terminate them how do you clean up the backups otherwise your Swift store is going to continually grow and eventually run out of space likewise how do you determine again via policies which ones are loved VMs versus the ones you should ignore and not do any DRs on well as a DR as a service we don't define that particular policy so it's it's up to it's up to the configurator or it's up to the user to choose which VMs to be having a DR or other so but I take the I take you a first question how much time you are going to keep your data as long as the DR is required we we just we keep the data and periodically we will merge we can merge the snapshots such that it doesn't take different different objects there so that we should be able to do it programmable and configurable so did I answer your question so one one more way to deal with this right and this is again from the experiences from our our cloud business which we are offering as a service to customers right we provide them we basically sell storage not VMs and in that particular practice that particular product right and we provide a mechanism by which we can say hey you know this is the max if you are interested in 100 terabytes worth of storage and you know at any given time you can spin up n number of VMs you know that's what we allow and and we define set of policies on what what types of VM and and snapshots can be deleted from the system it's completely customer driven at that point yeah you briefly mentioned some sort of replication engine or a generic replication engine an open stack could you elaborate a bit on that and add any thoughts about what sort of apis you might have for a generic interface to maybe different replication engines or well that's what that was the thing we are still going and we are thinking but what what we would need is like we all agree that there there must be some kind of a replication that should happen from the source and source to destination in a scenario where in this open stack to open stack the replication becomes more prominent or it is more visible from the functionality for a part of it one if I am a open stack endpoint I should have a capability to push my changes to the destination cloud so that that's one capability today today's open stack doesn't have so that we need to add that if if you are acting as a source you should be able to write to any kind of a destination okay there is a change that is happening in the this attachment you know let it be metadata so you you do a push of that and then when you are on the destination side you should be able to take different different replicator connections you know you should actually support a replicator plugins to the open stack such that you if you have a plugin and you have a replication endpoint you know you could bring third-party or a commercial solutions such that you don't need to worry about replication piece or you need you don't need to have a data channel through your network so you you can leverage the commercial very efficient things there so when you have this drafts as a service as an open stack project then you can always have a replication plugin so you define a drafts and you could have a replication hooks and in the hooks you could choose what replication will guard them or licensed or whatever whatever public open source or licensed products there so that's that's what that's how we are thinking to put that across any other questions thank you very much for attending thank you very much thank you very