 Good afternoon everyone. So welcome to our session, When Disaster Strikes the Cloud. I'm Michael Factor. I'm from IBM. I'm presenting with Sean Cohn from Red Hat and Ronan Katz from IBM. So we're gonna cover in today's talk and we're gonna split it. So I'm gonna take the first third, Ronan the second, and Sean the last. We're gonna talk a little bit about some level setting. What is disaster recovery, some concepts and basics? Then we're gonna spend the bulk of the time going into some detail about how we protect data and applications from disasters, talking about what's in Cinder today, what could have been, you know, stuff that just went into the most recent release, stuff from the Icehouse release, as well as looking at how one protects applications and not just the data because to recover from a disaster you need to be able to get your workloads running again, not just have your disaster sitting someplace else, and then finally we'll spend some time talking about what comes next, what still needs to be done, looking at kilo and beyond. So let's start off with some level setting. What is disaster recovery? Well, start with a definition. According to Wikipedia, disaster recovery is the process, policies, and procedures for recovery of technology infrastructure after a natural or human-induced disaster. In other words, disaster recovery is what is needed to get your IT running and up again in functional so that your business can be doing something useful after a disaster. Well, what's IT infrastructure? What are we interested in? Well, it's the servers, obviously. It's the storage, the network, the software, the configuration. If you look at this, it contains both data and metadata. Data, for instance, that's stored in storage, may be managed by Cinder. And it can be things like a database or a set of log files. Metadata is things like the configuration information, what VMs are running, how your network's configured, and so on. What kinds of disasters are we interested in? Well, when we mean disaster, we don't mean a disk crashed or server, you know, failed. We're looking at disasters which take out a data center, right? You know, things like floods, fires, terrorism, maybe even volcanoes, right? And so when you have a disaster that takes out a data center, that means you need geographic dispersion in order to be able to survive that disaster. So we talk about a primary, right? That's the data center where your application and your workload is running. And we talk about a secondary, which is where you're going to recover after that disaster. And in order to have the ability to survive a disaster, your data and metadata need to be available in a data center, in a location that's either on the secondary or some third location other than the primary. Because our assumption is after a disaster, the primary data center is gone, so anything that was stored at that primary data center is unavailable. So two key concepts in disaster recovery are recovery point objective, RPO, and recovery time objective, RTL. Recovery point objective basically says, how much did I lose? How far back in time does a disaster take me, right? Well, how far back in time the disaster takes you, right? The less data you want to lose, the more expensive it typically is. And it depends on your application, right? So if, you know, I'm working with my bank and making a deposit and then my bank has a disaster, I don't want them to lose that record of that deposit, right? Some applications need to have basically a zero recovery point objective. No data can be lost. Other applications are more forgiving, right? You may be able to lose seconds or minutes, hours, or even days and weeks of data. The less data you want to lose, so if you're sort of in this hour to seconds time frame, you're going to be doing some form of replication in which as modifications come in, either to the data or the metadata, they need to be replicated to a remote site. If you're more in the days or the weeks, and this may be what's done sometimes for some development environments, maybe a offsite backup is good enough, which after the disaster you would restore. Recovery time objective is how long does it take to get up and running after a disaster? Again, you know, more expensive the lower you are and where you need to be here depends on what your application is, right? Banks really don't want to be down, right? If you're supporting development, you can probably afford to be down for, you know, a little bit of time in the event of a disaster. The lower the RTO recovery time objective, the more active you need to be, right? You know, may need to have an active site with applications running in VMs provision. The greater the RTO, right, the more you can do after the disaster. So, one of the key things that I mentioned already is that to recover from a disaster, you need the data and the metadata available in some place other than the primary data center. Remember, after the disaster, that primary data center and anything in it may not exist. But the data and the metadata need not just be available, they also need to be available in a consistent fashion. If the data or the metadata are either inconsistent themselves or inconsistent with one another, we essentially have garbage. And informally, we can define consistency for the data as if a particular, if a particular datum is available, all the data that it depends upon is also available, right? So, if, you know, a concrete example, if I transfer money from my bank account to my son's bank account, right? If the fact that it arrived in my son's bank account is available, the fact that it went out of my bank account also needs to be available. And for metadata consistency, we need to ensure that configuration updates are seen in the same order relative to one another when we recover, as well as relative to the data updates, right? We don't want to see that we've transferred data for a new storage volume where we haven't updated the metadata update, the configuration update that says that storage volume exists. So, if we look at OpenStack, we see we have a range of different types of metadata, right? So, at networking Neutron, we have things like the virtual networks for storage. We have what volumes have been provisioned from Cinder, what VMs they've been attacked, what their types are. For VMs, Nova, we have the flavors of the virtual machines, SSH keys, and so on and so on. And one of the important things when we think about disaster recovery for OpenStack is thinking not just about the data, which is being put into the persistent volumes managed by Cinder, but also thinking about the metadata. So, now Ronan is going to take us through how we protect data and applications from disasters talking about some concrete work that's already been done in OpenStack. Thank you. So, first thing, data protection is part of Cinder if we have a persistent data. So, Cinder has a backup and restore facility allowing you to backup your volumes. So, basically what it allows you to do and in this example I'm showing Swift is doing a backup of your volume to Swift cloud and Swift have nice features which, you know, that cluster, object cluster could be replicated so it could be safe from disaster. And that's a nice facility you have in Cinder today. So, what happened after a disaster and I want my volume back and to get it from the backup? So, I just need to go and do a backup restore. One little bit of a problem is that the other cloud in my backup cloud is not aware of that backup because it's not registered there. So, one of the solution that went into high cells is the equivalent of electronic tape shipping, which basically allow you to create a backup on one cloud and restore that backup on another cloud. And basically this goes through the base mechanism of a backup export which creates you a backup reference to your object storage and you can take that backup reference and input that into any cloud and following that what you could do is very simply go out and do a backup restore of your volume into the secondary cloud and use the data on the secondary cloud, providing you with a backup restore style of disaster recovery. Cinder has more feature for disaster recovery. In Juneau, Cinder has initial support for volume replication. Basically what it allows a Cinder back end to advertise that it could do replication, basically providing a replication capability to the Cinder scheduler and volume that are created with the replication extra spec. You just, to enable replication, what you do is go out, create a volume type with this extra spec. It will build the replication true and when you do that and you have back end support replication, the scheduler is going to schedule, create your volume on that back end and the driver which support replication will create the volume on both back ends. You basically have a replicated volume in Cinder and for Juneau, the support for that for the IBM store wise and in the killer race, we have more driver expected to be in Cinder. So how does that work? So if you have a replicated volume and you want to switch to use the replica instead of the primary, what you need to do is basically do a replicate promote. Replicate promote goes out and make sure that UVMs will now be attaching the replicated volume, not the primary one. And if you want to reverse and go back to use the primary volume, the only thing you need to go out and do is replication re-enable, which in fact will restore the replication relationship but in this case in the reverse direction providing you with any updates that you're doing on your secondary copy to be promoted back to your primary copy and then to complete that you just do another promote command and you're switching back between primary and secondary allowing you to go back and forth between your primary and secondary volume. The other feature that went into Juneau that allows for disaster recovery is consistency groups. Basically that's the ability in Juneau to group volume together, to group volume together so they will stay consistent. As Michael said before, if you're writing data and it's written on more than one volume, we want to make sure that any operation that is being done on that group of volumes is being done consistent together. The base mechanism to use consistency group in Cinder is the volume type. It's based on volume type mechanisms and today what you could do is basically take a snapshot of consistent volumes and what needs still to be done to extend the mechanism that put in place in Cinder to support backups and to support volume replication. But then again, disaster recovery is not just about the data. It's not enough just to capture your data and make your data safe. You need to orchestrate your application and run that on the backup cloud or on your backup data center. So even if you're taking care of your data, you still need to do work with your VM instances, your network definition, your configuration, basically the metadata or the workload. So basically extending the concept of Cinder data protection into a workload protection. A workload is basically all the resources or all the definition you have in your environment in order to run your application correctly. So what OpenStack has for that? So in OpenStack, application can be easily defined using heat. So we have this hot template, the heat template that we can use to define applications. So they group all the features of application together. But one problem, not all applications are template-based. So many people are using OpenStack CLI directly and are not going to heat stacks. Moreover, deployment tends to change over time. We never have a static thing. Always changing in our deployment. We have more volumes. We are adding future machines. So basically what happens is that template and stack doesn't always stay consistent. They change over time. At least stacks can change over time. Well, template, you know, someone needs to go out and do that. And that's not always happening. And moreover, all these definitions are very cloud-specific. Everything in the template is going to be working on this specific cloud. Unless we have a replica of that cloud with exact definition, exact hardware, exact setting, not everything are going to work on the other end. So we have some tools that, for example, can take an actual deployment in a cloud and create a template for that. Some kind like reverse engineering, what we actually have running on the cloud. So there's two like flame and reheat, which we can find in the open source. But again, this template or newly created template just going to fit this one cloud that we're running on. It's not really going to fit on the other cloud. So what I want to show as part of this is a short demo how we could do a disaster recovery for workloads that we are shipping with IBM Cloud Manager as a technology preview. And basically what it allows you to do is go out and look at your cloud. We're logging into an OpenStack Cloud as an admin. And we can see that we have applications, in this case just a WordPress blog, a very simplified application for this example. And what we'll show is that you could easily orchestrate the backup of this application using panels which we added to Horizon and just create a disaster recovery policy and just name it and basically start and edit the disaster recovery policy to add resources to it. So in this example, we're going to go out and edit this policy, this WordPress protection policy, and we're going to add the resources that are associated with that policy so they're going to be protected together. And in this case, when I'm talking about protection, what's going to happen is that going to be packaged and put into an object storage. So here we are adding resources, just in this case, just a VM. And after that, what we could do is basically go out and trigger that protection policy. So what does that mean, go out and trigger in the policy? So we're going to package all the resources in that policy, basically the virtual machine, the image that it's using, if it had any volume, the volume going to be backed up using standard backup and everything going to be put in a Swift object storage and that package would actually be retrieved at any time to any cloud, not just the same cloud but it could be also considered as a backup restore mechanism restoring the actual work to the same cloud but it also could be restored to a different cloud entirely allowing us to move our application from one cloud to another. So basically what we're seeing is the process of the protection where we're actually taking a snapshot of our VM, uploading the data to Swift and then we can see that at the end of the day, we have a point in time stored in our object storage. So the only thing we need to do following a disaster is just go out to our backup cloud or secondary cloud and go out to the restore workload panel and we can see the list of point in times which we have backed up and just go and select one of them. In this case we have one and just click on restore and the nice thing about this is what happens now is basically we are executing, we are deploying a hit stack. So what we see is that the whole redeployment process on the recovery site is being done using hit and hit is being able to orchestrate the recovery of the application back into the cloud and we can now see that everything regarding that workload has been restored that include images and SSH keys, network configuration that we have back and we have this blog running on the other cloud and following this it's time to talk about the road ahead and at the end of the lecture, Sean will be taking down names who want to do work in OpenStack, so good luck, Sean. Thank you, Rene. All right, before we go and look at the road ahead, as you've seen, we're trying to solve a big problem and this big problem is bigger than one service in OpenStack. The fact that the foundation work was already laid in terms of adding replication piece is the foundation that we believe is the right way to go, right? Without being able to replicate your data outside of your current data center to somewhere else, you don't have backup, you don't have availability when it comes to disaster and failure and when we talk about disaster recovery, the notion is more than one site, right? It's bad enough that we have to deal with the application workloads and high availability within a single site and as you know, it's a very big complex problem that we're dealing today in different layers of OpenStack but looking at even multi-site, that I think it's a bigger problem to solve and the good news, we've laid that foundation and as you saw, we have like initial work done by IBM with StoreWide already supporting this. We're going to see much more vendors who's going to work on implementing that replication piece and let's look at the road ahead and just to give you some examples of other platforms and where we are. Does this work? It doesn't work. All right, so let's start with OpenSource after it defines storage that I know everybody knows about which is Ceph and Ceph already has what we called, as you know, there's seven tiers of disaster recovery, right? And the first 304 basically deals with basically a hot standby approach. So where we are right now in the Ceph lock is we have a good expert ability of the snapshots as well as ability to expert incremental snapshots. So in a sense, it's a good foundation for dealing with manual disaster recovery, meaning I have that thing going down from some reason, my data center got overheated or out of power. I can protect that workload that run on Ceph with the block bet again and even do incremental backups, right? So this piece is already there. The key part is actually looking at the notion of doing it in more than one site, right? If I have more than one block backend like in Ceph, how do I actually get things replicated between backends, between sites because I need to have that notion of my workloads being served at the other sites. And some of the activities, ongoing activities actually on Ceph late in block is of course the RBD mirroring. This is a big feature that's being worked on. Another one is actually making use of the new API for volume replication between RBD backends, right? So that's another one that's coming up, right? And we look at the multi-site in an object world. So this is a lot of the things you can do today as well with Swift. You can actually have something very similar what you do with Amazon as free when it comes to the different options you have with disaster recovery and all under one name space. This is all there. This is things you can do right now in your data center if you choose Ceph. And again, the big part is how do we synchronize between the data centers? How do we have a full backup or a partial backup that we can use and as well deal with users, affinity, all that stuff. So the good news, a lot of this work has been already done in OpenSec in the object world and they can utilize today. So if you're using Ceph today, looking at this will be the right approach and of course if you need the block, this is coming up. And when we look at what it means to actually to create different policies. So we started to show an example of actually using object storage, right? To create your policies. OpenSec is all about choice. And if you look at what service you would like to provide, upstairs to your tenants, right? We can actually provide different levels of policies. And just like today we have different back-ends that can support different flavors if you will of replication because not all replications are being dealt the same, right? You have synchronous replication, asynchronous replication. We actually have different tiers only when it comes to that policy. You can do object storage, you can back up to Swift. What it means in the end of the day you are going to have more choice in OpenSec to back up your workloads. And some of this as I mentioned earlier is like active to cold standby which is very makes sense in a cloud operating model, right? Because not everyone has the luxury of setting up a full new site, full blown with all your hardware there. The whole ability of the cloud is actually spun up instances when you need them. Disaster recovery is one of them. So if you have a disaster I need that workloads to be actually being served from other place. This is where the new policies can come up to speed, right? And we're going to have more matching done between what can be offered on the bottom all the way up to the tenants. And again this brings also a big understanding of how do we actually back up from our traditional environments to cloud to OpenStack or even between OpenStack to OpenStack, right? Right now we're still trying to solve the problem with a single OpenStack deployment. This is where we are right now. We're taking baby steps. But one day we'll be able to actually move our workloads from my private cloud OpenSec implementation maybe to rec space, right? Everything is there. All the APIs, everything is there. I can just do that. That's for me because I don't have to spend that big money just to make sure that my workloads work. In a sense this is disaster recovery in a hybrid cloud model that can be done. So this is the vision. This is where we want to go. But we're taking as I said baby steps to get there. And as Ronan mentioned we need augustration, right? It's good that we can kick replication or even change the direction of replication but how are we going to augustrate it all? You just saw that we were able to add consistency groups in Gino which is a big step, right? Because today your workloads doesn't specifically ride on only one. It's a multi-tier, right? So you can have different volumes that that application is spun off. You have database, you have log. You need to actually make sure that everything you capture and want to basically reload on the DR side is consistent, right? Meaning I'm doing snapshots. That snapshots need to be taken at the same time for me to actually load that application in the target side. So in a sense, yes, first of all we have to fix things around Cinder because we've got volume replication in place. We've got consistency groups in place. They don't talk to each other at the moment in terms of API. So that's something obviously on Cinder's plate to deal with. But it's not all about Cinder. I need Nova. I need Neutron. We haven't even touched about the users. I have different tenants users on the DR side. I have different network topology on the target side. How do I make that changes when I load that instances back? I need some orchestration tool and this is where heat can actually come into place and help that. And when we look at some of the works already in the plan for killer release for heat. So heat is starting to deal with bigger problem like rollback snapshot and rollback. The ability to take a chunk of snapshot and be able to rollback them. That's key. And we're going to see much more granular work being done in heat to orchestrate larger if you will disaster recovery points. And again, we need to make all that changes all the way down to the guest. But it's not all about having a single consistent initiator that will go to the other side and actually make sure that now I've laid down my new topology on the target side. What about the application consistency? So another thing that I actually on the plate right now for killer release submitted by Itachi, but long known in the QM world with Libvert is the ability to actually FS freeze, do a file system freeze for your workloads, right? So if we look at the work that's been done, this is already the code is merged. So to those of you who doesn't know QM OGA, QM OGA is basically you can load it in a guest. It's already supported out of the box in roll 7. What it does, it actually, if you look at the slide here, it goes and queues the open system. So the file system actually allows you to give a window that we can take that checkpoint, right? So wherever snapshot you're going to take, it's going to be file system consistent, meaning if I reload it on the target side I can actually load it, right? It's not crap. But it's not the end. This is just the start because we also need to take about application consistency. And the good news QM OGA also deal with that. How many of you used in your traditional virtualization application consistency backup with replication? Raise your hand. How many use Linux? MySQL to backup? Windows VMs, et cetera. So QM OGA actually has the notion that after you deal with the application consistency, it's support to be assessed for Microsoft Windows out of the box. So in roll 7 we actually installed the provider, the VSS provider as part of the deployment when you install the guest tools. So if you're running any top of Windows workload application, it will go ahead and automatically the consistent checkpoint when you create a live snapshot. In Linux, it's actually opens the door to more hooks, right? It allows you not just to deal with the faucet itself, but also inject a script that can quiz basically the database or any application. And most of the databases actually already support some level of hot backup mode so it can easily be scripted. What it means? It means that the next step for us is to actually come up with solutions that we're going to make use of this new technologies that's coming to board. So by the time we do our site evacuation to the other site, we can actually have the protective data, the metadata as Michael mentioned earlier which is a big piece of it, right? Each one of the services in OpenStack has its own metadata, has its own databases. We need to actually capture the state not just the data and move it to the target site. And the next one I want to discuss is actually even the bigger problem. Disaster recovery, as we mentioned is a big problem to solve. We cannot do it in one release in OpenStack. We know that but we're taking, as you see, there's a lot of building blocks that are taking place actually getting us where we want to go. But when we look at disaster recovery at scale, it's even more complex problem to solve because we need actually to deal with maybe a full data center and then prioritize that work and basically get actually triggers, right? So if today we have a lot of the customers that I'm working with are basically the first phase, right? They're doing expert, they're doing import, totally manual work. This is the whole way. We want automation. OpenStack is all about automation. We need to know first of all when a failure hit us bad so we can actually trigger the policies we discussed earlier, right? As you saw from the demo, we had to push button to do the failover. Is that good enough? I'm not sure. If you need to give 99.99999 service to your cloud tenants, that's not a solution. It's a good way to wake up the admin in the middle of the night, right? Just to push the button. So where we want to go is automated procedure and a lot of the work is being done. First of all, the OpenStack at J new approaches actually started to look at fencing using pacemaker for example to actually trigger that and they provide the monitoring piece as well for the infrastructure actually to detect that failure and take an approach. So in a sense, if you tied that to what I said earlier about the different policies and service tiers for disaster recovery, not all workloads have been born the same. Not all of them are mission critical, right? It's like 30% maybe mission critical. I want to be able to set for that 30% a policy that whenever something bad happened, right? It's not just have a ability. It's something that happened to the site level. It can automatically trigger all the automation piece and be able to spawn it off in the other cloud and believe it or not, my tenants user would not even know that something bad happened. This is where we want to go. We're not there yet, but at least we have the vision to go there and then taking down the line there. Last point regarding schedule that was done in Juno, in NOVA work is the simple tagging. Simple tagging actually can allow you to create a new tag for automation. In this case, automatic recovery. What it means we can actually use the NOVA client to go and monitor and look for that tag so whenever it spawn off, we can actually pull the trigger to do that failover. As you can see, the progress on disaster recovery is taking place in so many places today in OpenStack. Since there is the foundation, we're using heavily dependent on heat for orchestration and as Ronan mentioned, we need other tools as well to do the various engineering if you will, just to deal with the other site needs. But we also need to solve the problem from the compute side because this is where the triggering monitoring piece is going to take place. A lot of work is still in front of us but at least we have a good starting point. One thing that is really behind is this. How many of you have read this book already? Zero. You know why? It's not there yet. We're basically calling for papers to pick up the glove and work with us in the community of OpenStack to generate the book. All of you know we have a good starting point with high availability guide. I call it a baseline guide but it's the fact that customers still coming to us and say how we get this high availability set up, it means that we haven't solved this problem yet. So obviously we need to get this piece in place because that's the foundation but we also need actually to make sure we have a good guide in place to document what we have right now. If you look online, there's the Recover Cloud off the Disaster section which is currently very one page in the admin guide. Don't use it unless you really read it thoroughly and again it's not there. It's really a year old and it's not reflecting anything that's being done right now in OpenStack. So clearly as you saw we've done a lot of progress in the last two cycles and we continue to do the process but we need to capture that. So at least for the first tiers of Disaster Recovery the manual steps are outlined and everyone using OpenStack today can protect his workloads even with like basic object storage to Swift and enjoy this fruit. So that's our next step to just to get whatever we've done right now out there and of course the more work we're going to have, we're going to update that guide as well so we're basically calling for Voluteo to see how. I think we've concluded our time and it's a good time to call back my colleagues and take some time to do Q&A session. So who's going to want first questions for anything you've saw today? All right, no questions. Yeah, go ahead. Do you want to use the mic? I had a question around if tomorrow I have to implement this in a public cloud. So a public cloud vendor really wants to leverage this functionality. What's the deal? How can you do that? The entire Disaster Recovery. So it could be that the tenants are actually running on premise and they're using a public cloud for an extended data center and they want to use the public cloud side as a recovery site. So I'm going to start and then let my colleagues continue. The good news, as I said, there's a lot of manual steps that you can actually do today. I gave the example earlier of SAF. SAF you can actually pipe one command line to do a lot of the work for you. So it pretty much depends what choices you're going to make as your back ends and then look what automation pieces you can solve today. I don't want to say it's not possible. It is possible, but it requires a lot of work just to get it automated to a point where you can actually deliver service upon. The good news is you can actually protect your workloads today. It's not complete, as you saw. A lot of the pieces of consistency is not there. But there's a good starting point. Just to add to that, basically today for a lot of things you need cloud provider support. For example, if you use SAF for your back end, you need it to support backup for SAF. But eventually I think as we go forward and we have more feature in OpenSTAP, we would be able to use the public interfaces to do disaster recovery. So that's the goal. If you have, you can use the public interfaces and make disaster recovery happen. So you could do that on your private cloud. You could do that on your public cloud. Again, it's going to be your choice. So back up to a Swift on a public cloud since you're asking about how to go from a private cloud to a public for DR. So the simplest way is simply to do a backup into Swift on one of many providers who have Swift on public clouds. And then afterwards you bring up OpenSTAP on various providers who are able to bring up OpenSTAP on public clouds and can restore. Are there questions? My question is more with the direction of taking heat into account. If you have Solometer and there's an application we're just taking around, there's an application which is doing some counters on some tennis, etc. If you use heat there, it's not going to work, right? What are the alternatives? And that's what I saw that was obvious. So let me just repeat the question. We're starting basically leveraging heat, but heat is not solving all the problems. In a nutshell, E-Action mentioned telemetry. We're looking at telemetry as well, right? For some of the Augustation. The net is we don't have a full notion of Augustrator for disaster recovery out of the box. As you saw, this is a cross project within OpenSTAP. And actually there's a EFA pad going on right now for the cross project. There's a disaster recovery section there. A lot of vendors started to communicate that. So as a follow-up, I suggest capturing this as well as a requirement for follow-up. I can tell you that some of the work and foundation work we have looked at from the heat perspective does make sense to do things in heat, but we need to set the foundation, meaning we need to invest and create new features and you need to actually make it happen. It's not there. And I think it's time we need to clear the room for the next session. So if you have questions, you know, you will have the presentation on the OpenSTAP website. So you're invited to talk about it.