 Okay. So, good afternoon everyone. Apparently I'm the last talk between now and the evening events. I'm going to be talking about a survey of disaster recovery approaches for open-stack deployments. And the outline and objective of this talk is first to ensure that we have some common terminology on what exactly we mean by DR disaster recovery. I've seen disaster recovery for open-stack talking about things like how to deal with a device failure or how to deal with a software failure. And I want to talk about a somewhat different aspect of disaster recovery. Talk a little bit about key concepts of recovery, time objective, recovery point objective, the importance of consistency and different types of consistency that can play in DR solutions and support that one may have from the underlying storage. Then I want to go over three different approaches to providing disaster recovery for workloads running on open-stack infrastructure. A generic approach which uses only open-stack mechanisms makes no assumptions on the application and no assumptions on the underlying storage infrastructure. An application specific approach which makes certain assumptions on the application and requires the cooperation of the application. And what I called here an advanced approach but one that takes advantage of certain assumptions on the storage infrastructure. And then finally, I want to compare these approaches and raise for consideration some things we may want to see in open-stack in the future to better support disaster recovery. So let's start with what is disaster recovery? And I took here definitions. This is the definition that was in the abstract. It comes from Wikipedia. Disaster recovery is the process, policies and procedures for recovery of technology infrastructure after a natural or human-induced disaster. Let's look at this definition in a little more detail. So natural or human-induced disaster. What type of disasters are we talking about? We're talking about things like floods, hurricanes, tornadoes, earthquakes, potentially even volcanoes, poisoning events, fires, terrorism, things that can take out an entire data center. So we're not worried about a device failing, a server failing, a localized software failure. We're worried about a data center being unusable either permanently or for a significant period of time and the need to move the work over to another data center in order to continue operations. So surviving a disaster requires geographic dispersion, right? We can't survive a disaster, at least the types of disasters we're talking about here within a single data center because the assumption is a disaster takes out a data center. Technology infrastructure, right? Business continuity is involved with the survival of the overall business. Disaster recovery is looking simply at the technology infrastructure, the IT component of the business. But it's looking at all of the IT component, the servers, the storage, the network, the software, the configuration, everything related to the IT. In the interest of time and to enable a focus, I'm going to in this presentation be focusing mostly on the storage aspect. Storage aspect is critical because that's the persistent state. That's what's changing as the application modifies its state and it's what needs to be dispersed and replicated if you want to be able to survive that disaster. Finally, the processes, policies and procedures for recoveries. And these processes, policies and procedures for recoveries have three different elements. There's what do you do before the disaster? What do you do when the disaster happens and what do you do after the disaster? Now if we look in more detail of what do you do up front? This is the good part. This is when everything's working fine. It too has multiple elements. There's planning, figuring out what data needs to be copied, what data needs to be geographically dispersed. How do you copy that data? And testing it, making sure your solution actually works. And there are various ways you can look at copying the data. You could look at a continuous approach where data is all the time being copied by the system or a periodic approach where once a day, once a week, once every six hours, data is copied. Now a continuous approach can either be synchronous in which all updates to your primary data are replicated to a remote site, to your secondary data center, are geographically dispersed before that write actually is completed back up to the host, back to the guest, the application making the right. Or asynchronous in which a write occurs is completed from the application's perspective. And then at some point in the future, that data will be replicated to the remote site. Or it could be done periodic and the periodic can be done online, transferring the data over a network or offline. There are still many DR solutions out there where people put the data on tape or a daily backup goes onto a tape and that tape is shipped off site and that is a form of DR solution. Then there's detection. I have to be able to figure out a disaster's occurring and respond to that disaster otherwise I'm not going to be able to go to the next step, which is recovery. And the recovery involves recovering the infrastructure and it involves recovering the application. I need to make sure that at my recovery site, my secondary data center, which is now becoming my production data center, I have all the necessary infrastructure up there and operational. And depending on the solution that I've deployed, I may have that infrastructure operational all the time. Or I may need to go and buy new servers or lease new servers or bring in new servers from somewhere or new storage and get the infrastructure up. And then once I have that infrastructure, I need to get the applications up and running. One key concept or two key concepts in particular when I'm planning for my DR solution, I need to consider something that's called the recovery point objective and the recovery time objective. Recovery point objective or RPO is how far back in time a disaster is going to take me. In other words, how much data I've lost, how much work I've lost. And the closer that that recovery point objective gets to zero, in general, the more expensive the solution is, right? If I'm doing something where I'm doing a backup, putting it on tape, and once a week shipping that tape off site, my recovery point objective is going to be one week. It's a relatively inexpensive solution, but I lose a lot of data when there's a disaster. If I want a recovery point objective of zero, that means any updates that I've made to my persistent state, anything that's been written to the storage is guaranteed to survive a disaster. I need to have a synchronous solution in which every right is copied from the primary data center to the secondary data center as part of the actual process of having that host write that data. That's the only way to guarantee that no data gets lost. Now, as I said, the lower the recovery point objective gets, the more expensive the solution gets. Some workloads require an RPO of zero, some don't. Recovery time objective, how long does it take me to get operational after I had a disaster, right? It may take me, you know, if I have a hot data center with servers sitting there running, my applications already loaded into VMs, everything already there, sitting there just ready to go and start working, I can have an RTO that approximates zero. In practice, I'm not sure you can ever actually get to an RTO of zero, but I can get fairly close to zero. But that can be really expensive because basically I need everything sitting there running ready to take over just in the event that I have a disaster. So I've basically doubled the costs. On the other hand, if I don't have any remote site, I just have a bunch of tapes sitting somewhere, right? I'm going to have a very long RTO because I'm going to have to go find a data center, I'm going to have to go find servers, I'm going to have to load the data from tapes and so on, right? And these tapes, you know, while I'm using tapes as a concept here, these can be virtual tapes that can be a set of disks that have been put offline, they don't actually have to be physical tapes. A third concept that's really important to consider is the consistency of the data. What is the state of my data at the recovery site when I need to go and recover? And the easiest way to explain this is to have sort of a conceptual application. And here we have an application that has a set of transactions. In each transaction, it writes a piece of data to volume one and then to volume two. So up at the top, you know, it writes A to volume one and then writes B to volume two in that first transaction, then it writes C to volume two and D to volume one in the, you know, second transaction and so on. And there are multiple states that are possible of the data at the secondary site because the data is being written to multiple volumes which are independent entities. So we could have a situation at the secondary site where on volume one we had A and D and on volume two we had B. Now this is inconsistent. It's inconsistent because there is no logical flow of the application where this could have occurred and this happens if, for instance, volume one was copied, sorry, if volume two was copied and then at some later point in time volume one was copied. In other words, I've made the copies of the volumes at logically different points in time. Second type of consistency we consider is power fail consistency. In power fail consistency, I have a state of the data at my recovery site that could have happened had there been an instantaneous power failure at the primary site. And most serious applications know how to recover from a power outage. And so this is a state of data that most applications know how to recover from. Now it may take a long time to recover from them. If it's a database it needs to go and do a complete log replay depending on the size of the log. File systems may need to do file system checks. So it can take time to recover. But application should in general know how to recover from the state. The final type of consistency to call out here is what we call application consistency. An application consistency says I have a state of the data at the secondary site that is semantically meaningful to the application. So here for instance we see we have a complete transaction and only complete transactions. Now to get application consistency you need some hook into the application. So Microsoft has VSS. There are various applications that have various hooks where the code responsible for the replication can inform the application at once a consistent view of the data and enforce that the data that gets replicated is at a consistent point in time. Now having application consistency you're not going to be able to get to an RPO of zero because not all points in time are application consistent. Finally let me talk a little bit about storage system support and I'm going to talk about this more from the perspective of a storage controller but some of this stuff could be done in principle at the level of software solutions. So many storage systems in fact most storage subsystems today support synchronous replication. And in synchronous replication it's the responsibility of the storage subsystem to ensure that data is replicated as part of writing that data. And in synchronous replication the data is hardened at the remote site prior to an app of a host write going back to the app of a write going back to the host and this enables a RPO of zero. So here our host is writing A that data goes to the primary copy. The storage system will copy that data the secondary get back an app from the secondary only after that has been act will give the act back to the host. This enables as I said you know an RPO of zero because every write is replicated as part of the process of the application doing the right and before the application proceeds to the next right it knows that its prior right exists at both data centers. In asynchronous replication data is copied after the act. It's not copied as part of the actual write but copied after the act and this guarantees that the recovery point objective is going to be greater than zero. With asynchronous replication when there is a disaster there will be data loss. So with asynchronous replication the host write the primary storage gives back an acknowledgement to the host. At some later point in time that data will be replicated over to the secondary site and act back to the primary. Now another concept that most storage subsystems have is a concept of consistency groups. It's a set of volumes which it will ensure their replicas are maintained in a power fail consistent state at the remote site. And consistency groups are essential when you're doing with asynchronous replication. For some of the corner cases they're also important for synchronous replication but it's a basic feature that exists in many storage subsystems. And as I said these features are you know not unique to any one particular product it's something that's fairly common in storage subsystems today. Okay so with that as background how can we provide DR for an open stack environment. As I said I want to talk about three different approaches. A generic approach which uses only open stack mechanisms makes no assumptions on what the application can do and makes no assumption or any support from the storage subsystems. A application specific approach which is tied to the behavior of specific guests. And a approach that leverages storage subsystem support. And then each of these approaches can be evaluated on five different parameters. What type of recovery point objective they support. What type of recovery time objectives they support. And so in other words how much data is lost. How long it takes us to get back to being operational. What sort of consistency model they support. And then two other approaches which are fairly self describing how complex is it to implement the solution. And how generic is the solution. Okay so let's start with the generic approach. Here the assumption is we know in this example we have a VM attached to this VM or two volumes which have been provisioned via Cinder. The VM has been provisioned via Nova. We also have a Swift installation at a remote data center. So we have our production data center. A primary site which has the VM and the two volumes attached to it and Swift at the secondary site. So one way we can achieve DR is by taking advantage of Nova Snapshot to snapshot that VM and copying the application persistence data. So we would issue a Nova Snapshot to the VM. This should then go and invokes into Snapshot. First on volume one. Then on volume two. So I've now Snapshot it and copied essentially the two volumes attached to that VM. But there's a problem. What is the problem? Those copies are inconsistent. Additional writes that are dependent upon one another could have entered to the volumes between the Snapshot of volume one and the Snapshot of volume two. Which can give me a state that's inconsistent. After I have those Snapshots I can now again and create a volume from the Snapshots within Cinder. And then once I have those volumes and this can be done in any particular order. There's not an issue of inconsistency here because those Snapshots are not being modified. Once I have those volumes I can go and back them up via the mechanism that was added into Grizzly to back up a volume to Swift. And again this can be done in any order because those volumes are not attached to not being modified. But this brings up another issue. This is going to cause a very high recovery point objective because each time I want to create a copy of the data at my remote site basically have a you know the ability to survive from some point in time. I'm copying the entire volumes. All of the persistent data. So I'm copying a huge amount of data potentially and if I'm copying a lot of data I can only do that so often which can lead to a fairly high recovery point objective. Okay so after a disaster at the secondary I recover the volumes. I can recover from the backups in Swift to two volumes that can then be managed by Cinder. And this brings up the third issue with this solution. It's a high recovery time objective. I'm recovering an entire volume copying an entire volume of data out of Swift into the volumes that are managed by Cinder. If there's a large amount of data this is going to be time consuming. After I have the data back in the volumes well then I can go and restart a VM, reattach the volumes to VM and get back up and you know running on my recovery site. Now I've skipped a lot of steps here. Right there's a lot of work that needs to be done here. As I said I was focusing on storage but clearly we need to copy the images from glance. We need to have the images at both sites. We need to ensure that the network configuration the quantum configuration is cloned and appropriate managed at the remote site so we set up. You know we're not dealing typically with a single VM but with a collection of VMs. We need to make sure the Nova configuration is correctly managed and we need to make sure that the Cinder configuration is correctly managed. Clearly there's a lot of additional work here. A second approach is the application specific approach. Right having disaster recovery using guest provided mechanisms and in this case the assumption is the application knows how to provide DR for itself. So for instance databases often know how to do this for by example shipping a log. So if I do a database update at one site the database can be hooked up with a copy of itself running at another site where any updates to the log at a primary site will be shipped off to the database at the secondary site and then those log updates will be applied at the secondary site. And the way this would work in an open stack environment is that an admin would need to provision the VM or VMs plus the storage plus the network at both the primary and secondary site. In other words the v the admin is going to be setting up the normal open stack mechanisms the workload to run at the primary site and the workload to run at the secondary site. Then the admin needs to configure the application to say that this is my primary copy this is my secondary copy and that's done completely outside of the scope of open stack. Open stack is completely unaware of this and then the admin needs to start the workload at the primary and the secondary. After it started the user interacts with the primary and it's completely at the application level that the data gets transferred from the primary to the secondary site. There's no involvement of open stack in this. When a disaster occurs the admin again without any involvement of open stack needs to configure the secondary as the new production workload and the application will perform any additional cleanup it needs to do to start you know working as the production and then the user starts interacting with the secondary. So what do we see here right? What are the benefits? Application based recovery can give us good RPO, good recovery point objective, good recovery time objective and good consistency right but the drawback and the key assumption here is it depends on the application right there are only certain applications that support application level DR for instance enterprise databases and it's not a generic solution right and it works only within the confines of that application. I can't get cross application consistency so if I have a workload that consists of some homegrown application or some other application that doesn't support application level DR plus a database that does even if my database does it I'm not going to have consistency across those two parts of the workload. Then the final thing to call out here is disaster recovery using storage functions right and there's several ways we can do this. One is we could use pools of prepared volume so essentially the idea here is the admin prior to giving storage over to syndrome manage would at both the primary site and the secondary site provision volumes and set up a copy relationship between them either a synchronous replication or an asynchronous replication between these volumes. These volumes would need to be provisioned as at the storage level either as very as thinly provisioned or to expand because we don't know what the size is. Then we would have the sender driver give the driver for the particular storage and sender access to this pool and when there's a provisioning request that comes in the sender driver would take a pair out of this pool and essentially give it to the application now from this higher level sender perspective it's only giving the volume at the primary site but what it's actually doing is taking a pair and inside that sender driver managing that pool. So each time it comes a provision request comes in it's going to be giving one of these pools and that's what gets given up to the higher level. Second way we can do this is by externally creating the volumes on demand at both sites and then externally doing the defining the relationship for the replication at the storage level. This requires an external communication pipe between the admin at the primary site and the secondary site and obviously this could be the same person right and what would happen is OpenStack is invoked independently at both sites a volume is provisioned at the primary site it's provisioned at the secondary site through a sender via sender those are normal OpenStack mechanisms to provision then outside of OpenStack those two volumes get tied together and you know we need to externally manage that replication and in the event of a disaster you know all of that gets done outside of OpenStack. So we would provision at the primary site tell the admin at the secondary site to go ahead and provision provision at the secondary site and tie those together right and we would do this for each volume we wanted to provision. Now even more than with the generic approach where I hand-waved and skipped a lot of the details here there are even more details that are being skipped and the details here end up being to some degree dependent on the details of the particular storage implementation whatever sitting behind that sender management driver but there's also a lot of details on everything that's you know beyond simply getting the persistent data from the production site to the secondary site. So you know we need to probably place the databases on replicated storage if we want to ensure the configurations are replicated and you know this could be realistic if we're using storage level replication but in all of these things the devil is in the the details. Okay so let's go through a quick comparison of the approaches as they are today. So if we look at the generic approach it's weak in terms of the recovery point objective the recovery time objective and consistency in fact consistency could be a sufficiently large problem in many use cases to make it actually not usable right most workloads can't recover from an inconsistent state. Its complexity is moderate the reason I put its complexity is moderate is there's no automation no scripting for it there but this could be overcome I believe and by definition it's really good in being generic right it makes no assumptions other than the assumptions on open stack. If we look at application specific it can have really good recovery point objective really good recovery time objective really good consistency that's sort of what it's built for right its complexity again is moderate where the complexity actually is will depend on the details of a specific application and how that application manages it but by definition it's not generic only works with that specific application and you have to do something different for every different type of application. Finally looking at the storage based approaches they have good RPO RTO consistency and these are the approaches and the mechanisms that are used by banks for instance to to replicate financial transactions so they've got all the stuff for you know being you know no loss of data being able to get you know app and working quickly and being consistent but they can be fairly complicated to manage and there is no way to integrate them today into open stack and what I showed before with you know either pools of paired volumes or doing it outside of open stack this is not for the faint of heart and not be easy to put together and the approaches are moderately generic in that you know on a given storage implementation any application will work so once you've done it for one application you've done it for all applications because they're working at the storage level they're not working at an application level okay so where I think we would like to get to is probably improving the generic approach in terms of its RPO its RTO and its consistency it'll probably never be you know if we need to be completely generic and not take advantage of any knowledge of the storage or any knowledge of the application we're probably never going to get to pure green on those places we probably need to reduce the complexity of all the solutions and you know we're never going to get to be more generic in terms of the application specific or the storage you know based approaches so some thoughts and you know I think we'll have some time so I'd like to hear hear also what others in the audience think we might want to do but you know some thoughts on what we might want from OpenStack so first off I think consistency groups could be very useful to have in OpenStack an ability to define a set of volumes where I want to make a consistent snapshot of that set of volumes right and these set of volumes can be attached to one VM or multiple VMs but I'd like to you know be able to snapshot those volumes in such a way that I have power fail consistency on that collection of snapshots second incremental copy of snapshots to Swift so if I'm backing up data into Swift looking at that generic approach if I have the ability to incrementally copy the data not copy an entire snapshot and backing up an entire volume but just copy which changed since the last time I did it maybe as a delta object right that can greatly improve my recovery point objective because then I'm able to do this snapshotting and this backing up into Swift much more frequently finally not finally the flip side of that is increment to restore snapshots from Swift so if I at my secondary site recovered a base version of the volume from Swift into a volume managed by Cinder and then I had a incremental update I'd like to be able to retrieve from Swift and apply only that incremental update so basically just taking what's changed backing it up and taking what's changed and restoring as opposed to doing a complete backup in a complete restore in terms of complexity I think we need some automation of setup recovery of testing right for a DR solution don't want to have everyone who's doing a DR solution basically start from scratch and I think we need the ability in through a standard mechanism to integrate with storage level of replication functions and consistency groups so if the storage system whatever it may be supports these features allow them to be taken and driven by open stack in a way that open stack was it's driving it can integrate with Nova and with Quantum and finally I think we need a generic mechanism to support replicating of project DB's and images of glance independent of the mechanism for replicating the persistent state that is modified by the application so with that I'd be happy for questions and be very happy to hear if people have views or opinions on other ways of doing things or what else they might want to see and just so you know I can barely see anyone so if open stack replicated mirrored so that would help it could help in some ways you would need to go I mean if you if you're replicating all the rights going down to the volume is underneath open stack today right because the hypervisor is going directly to the storage not going through open stack on the persistent rights you would need to make sure that your primary and your secondary sites are close enough that you could do that synchronously as part of the operation done by the host but it definitely hopes that helps such a mechanism so Cinder's backup doesn't support that today this is not a swift statement as much as how do you have the whole flow together if Cinder would do an incremental backup then doing the incremental restore would be easier and you know Cinder's backup mechanism today is written such that it can work with multiple back ends so it'd be nice to have a back end other than swift for the backup can you elaborate on what you yeah you need to have I mean to make the problem tractable you need to have a certain degree of similarity of the infrastructure at least that part of the infrastructure you want to support so you need to be using the same hypervisor at both sites you don't necessarily need to be using the same storage architecture for instance if you go with the generic approach part of the value of the generic approaches you could use heterogeneous storage at the two sites but I don't see it as being tractable to use different storage at different hypervisors so so I think you're addressing is how we could do the integration with a storage level replication by having an abstraction through Cinder where it would be able to provision a volume which could be replicated and then Cinder could know that this volume is replicated I understand that right I think ideally this should I think ideally this should be done in such a way though that it wouldn't necessarily require only a storage array something that should work well with storage ways but if you know there was a software based implementation it could work with that as well okay I'd be happy to hear a little bit more afterwards I'd be happy to discuss this more with you afterwards other questions replicating the what images so so yeah I mean I pointed out that I didn't really that I really was focusing on the persistent storage attached to the applications but we need to replicate you know glance needs to be replicated or you know stored in some third location have its its its persistent data stored in some third location so both the secondary site both the primary site and the secondary site have access to those images of the boot okay now I don't think you need to do a replication of the boot image itself if you know in general I don't think you need to do an application of the boot image we have a good point yeah I mean I think one of the things that needs to be done is look at how these applications one does DR for these applications without open stack and not reinvent the wheel I think in many cases these applications if they're going to be doing DR aren't going to be booting from a local device and so the issue may be then having the ability to boot from non local devices which is there and then treat those non local devices as the same way you treat the rest of the persistent storage the support I believe is there but in sufficient other questions yes yeah I'm specifically was here in this presentation was looking at what happens when a data center goes belly up not a server fails or device fails or some piece of software fails but the entire data center and focusing on the situation where geographic dispersion is inherently necessary in order to address the problem that is correct but mechanisms that salvage shade don't necessarily work for for DR right so Amazon leaves a lot of stuff to you but it also says if you want to do this take your EBS volumes and back them up to S3 right because the EBS volumes don't have DR and S3 does right and so if you looked at the generic approach I talked about was basically saying take the Cinder volumes what's akin to EBS and copy them over to Swift which is what's akin to S3 so that generic approach is very much aligned with what you just described and then the other thing I said was it would be really nice to have some automation around there's some scripts such that it wasn't all done manually right Amazon leaves an awful lot to you and there's a large community that's developed around you know tools and extra you know utilities that you can buy that automate this stuff right and that there is value in that you know it's not a completely absurd approach other questions thank you