 Hello, everyone. Welcome to the Open Data Foundation Office Hour. I am Michelle Dupama, Principal Solution Architect and Masquerading as host, and I'm here with Annette Cluitt. We're going to talk about something that should be near and dear to anyone's heart who deals with storage, disaster recovery. So this is Annette, my fellow Red Hatter and friend. Please introduce yourself and let's get into disaster. All right. Thank you, Michelle. Yeah. Disaster and how we're going to deal with it with OpenShift Advanced Cluster Management and ODF, OpenShift Data Foundation. So I'm going to just kick off into some slides here just so we can sort of set the stage. Okay. Then later I'll actually do a demo to show you not only is this stuff all real, but it's all available in shipping product. So that's always a good thing. Let me go ahead and share my screen first. That'd be a good idea. And then I'll just get the slides going here. Okay. Hopefully everyone can see this. Yep. Yeah. Just let me know that you see it. Okay. All right. So what we're going to do is talk about sort of the ways that you can get resiliency really in any situation, but we're going to use sort of the containerized environment in OpenShift as our example. So if we start on the top there, one thing that people have done in all kinds of environments for a long time is to create backups on some regular basis. And then if needed, be able to restore those backups. And it could be snapshot-based. And if we relate it to OpenShift and ODF, it definitely is snapshot-based. You have the ability today in OpenShift and have had for a few releases to create snapshots. And if you're using ODF, then that initiates the container storage interface driver to create the snapshot, put it into the storage cluster. And then when you need to recover to a known good point, you can basically use that snapshot to recover a persistent volume and mount it back in your application. So that can be all orchestrated. We'll look a little bit later how we can orchestrate that. But that is sort of a basic capability and doesn't go away with any of the disaster recovery solutions. So sort of the next level and the one that we're going to focus on today is what we call regional disaster recovery. And this is meant to protect you from an entire area outage. If anyone in the US or anyone remembers the Sandy hurricane that happened in the New Jersey and New York area, I think it was about six, seven years ago. I was working for a company that was in New York at that time. And I can tell you that people figured out quite quickly that they had too many resources and data centers and colos located in that area. And so with regional DR, you want to separate your disaster recovery capability enough that you're not susceptible to like an area outage like the Sandy hurricane. So regional, in this case, do you want to, you're talking about different continents, different coastlines? How far away do we have to get? Yeah, in reality, what I have seen, and again, this is not specific to containerized platforms, but a regional would be within the continental US, for example. You wouldn't usually say put one site like an India and one site in New York, right? There's some laws about data and how you can have data within a particular country. So something like what we call AMIA, you know, the entire sort of AMIA area, for the most part, doesn't have a lot of distance between it. But it's usually either regional based on sort of country boundaries, or at least regional based on, you know, that I can ship data and not have any regulatory issues. So the asynchronous part of it means because we have that distance between the sites for the protecting against an entire area outage, we basically can't make sure that we can do a write and then get the right act and be able to have good basically IO performance. So what we do is we essentially ship the data, and we don't require that it's act before the client continues. So that's sort of asynchronous, which means, you know, I get the data to the other side. But I if there's an outage, I could lose some data in flight, or I could basically if it's snapshot based, I could lose some data because I don't have the latest snapshot. If we contrast that to the bottom, this is sort of getting it's sort of the best of both words, except it is latency limited in terms of distance between the locations to keep that synchronous and to keep the IO performance, what it needs to be, which means synchronous means I do write, I wait for all of the replicas, if I have replicas and with ODF, we know the default is three replicas. I wait for all the replicas to be written and then I act back to the client. So at that point, you know, the client continue with the next IO. Because of that, we have to keep the distance limited. And so that would be the Metro DR and I'll go into each one of these a little bit more. Okay. So the solutions that from a Red Hat perspective that we bring to this on the downstream projects, the downstream products are Red Hat Advanced Cluster Management. And we'll see later on, I'll show you some of the components that are integrated in order to do disaster recovery. You'll see in the demo that this is pretty easy to use. There is some upfront configuration you have to do with all these components, but it's pretty easy now to do a failover based on an application basis. The platform is that we're going to use, we're calling it ODR, which is a bit of DR cluster operator. So the ODR hub operator will be on what we call the hub cluster and the ODR cluster operator will be on the managed clusters. I had a question from the previous slide and I see it's mentioned here. What is RTO? RTO is recovery time objective. So these are standard terms and disaster recovery or business continuity. What it means is that it's a time, it's your objective for when you're actually recovered, meaning that your application is able to continue to take IOs and continue to operate for whatever it's doing. Recovery point objective is your tolerance for data loss. So like I said, with the asynchronous situation, I could be doing snapshots every five minutes and if I had an outage, my last snapshot could be used, but it might be missing up to five minutes of data. So that would be like my recovery point objective. So how much data am I willing to lose before I, you know, basically it's a design goal, right? We have a question in the chat. Do you want to take questions now or do you want to? Yeah, let's just go for it. Okay, so from the chat we have, does this mean that ODF is becoming an alternative for Karsten Trello backups? Uh, I'm not sure what that is. So yeah, so or I think something that should be clear is that we're not prescriptive about the actual backup mechanism, correct? Like this is just- No, no, yeah. If I heard you correctly, did you say Karsten? Karsten, thank you. Yes. Yeah, I'm not familiar with that. ODF is not a backup utility. ODF itself, OpenShift Data Foundation, uses, you know, container storage interface and it has the features of Snapshot and Clone. So we'll see it integrates with, it integrates with either third-party products, convolt, net backup, Spectrum Protect Plus, Trello, Karsten K10. It integrates with those products to provide sort of enterprise-level backup. And we also have our own solution called OpenShift APIs for data protection. So that's just, we'll get to some of that, Michelle. So, but if there's any other questions, let me know. But that particular backup, I'm not familiar with. Okay. Oh, we were here. So just finishing out here, combining these solutions. So we have sort of the middle, which is the OpenShift Disaster Recovery Operators. And then on the bottom, we have OpenShift Data Foundation. Now, what OpenShift Data Foundation is going to supply in this particular area is synchronous mirroring. So the mirroring, the replication capability that goes for each volume on a per volume basis, we're going to replicate, say, every five minutes, the Delta contents, so not the entire image, but the Delta from one cluster to the other, to the alternate cluster, that is essentially the disaster recovery target for that application. So just, and this is a little bit too, a lot of data, but these are sort of the three things we've sort of added in here, cluster HA, which I'll just go over quickly, which is giving you high availability at a single cluster level. And you can see it has RTO equals zero, RTO equals zero. But just be aware that these, this is out of the box, really, for either OpenShift or plus ODF if you're installing on AWS or Azure or GCP or places that have AZs. OpenShift is able to use that environment and schedule itself so that it has high availability. But the availability zones need to be quite close to each other, as they are in all those cloud environments that I said. As noted there, right now, vSphere and bare metal do not do not have topology labels out of the box when you deploy OpenShift, but that is coming so that you can basically create AZs for single cluster. And then we already went over a bit regional DR. And just one thing between Metro, we're calling Metro DR, regional DR, there's really no latency limits between locations for regional DR. There could be some implications with replication time, but with Metro DR, there definitely is. So if we look at, we talked about the backup, and like I said, this is your sort of first approach to trying to survive or recover from outage. There's lots of solutions here. The ODF is again supplying the snapshot and clone capability. There are other CSI drivers that also do the same thing. So OADP has been a community operator available in Operator Hub for the last two years, I believe. It is now very soon going to be a Red Hat supported operator. So it basically replaces something like a Trilio or a Kasten K10 in terms of the ability to back up not only the snapshots for persistent data, but also the Kubernetes resources. And what it's really doing is it's making available to you within OpenShip the Valero APIs. And if you're familiar with Valero, that would be backup, restore, schedule. It also has the ability, the Valero CSI plug-in. So all of that. But you can use OADP. OADP will get you down the way, but if you're really doing enterprise-level backups, you'd want to be using probably something more and that has a UI and has more capability. Okay, so on that point, you just mentioned Trilio. Is that what it is that you were just talking about? Yeah. So this would replace them. Okay, so I was just looking at the chat. Yeah. So about over a year ago, I and a couple of our colleagues, Michelle, did a lot of work with Trilio Spectrum Protect Plus, Kasten, and we created, we not only did testing, but we also created solution guides. So the main thing that these products have to do in terms of interface with OpenShift and ODF, with OpenShift, they have to be able to backup the Kubernetes resources. Most of them do it at a namespace level. So or you have to be, when you recreate the application either on same cluster or alternate cluster, you have to be able to restore the Kubernetes resources, the pods, the routes, the services. If you can't do that, I don't really care if you restore the persistent data. Even the persistent data has to be restored in Kubernetes resources, those being PBCs and PVs. So they all have to work together and OADP is an alternative using the upstream Valero APIs to do what Trilio and Kasten and Commvault do by using the CSI driver capability of Snapshot and Cloud. So it's just, if you just want to not basically have to get a third party, you could start with OADP and see if that works for you. And OADP Valero is involved in migration as well, correct? I think I remember. It can be. It's sort of the same thing, right? Because when you use OADP, whether you're restoring to same cluster or different cluster, you store the data in an S3 bucket. So in migration, you know, it's the migration toolkit that is also an operator within operator hub for OpenShift and has spent for like four years now for migrating from, say, OpenShift 3 to OpenShift 4. It does a similar thing, but it does some amount of translation between Kubernetes resources because the three Kubernetes resources are not the same as the OpenShift 4 Kubernetes resources. I mean, there can be some changes, right? So there's some amount of transformation. And most of these products offer transformation, which means, you know, I start maybe using one storage class for a volume on cluster one, and then when I get to cluster two, I transform and use a different storage class, which that's exactly how you do migration, right, between, say, OpenShift 3 to OpenShift 4. Okay. Great. So just, I'm not going to spend a lot of time because we want to get to regional DR, but we can make these slides available. This is basically what I just went through. And the bottom part there is the important part for ODF that we provide the snapshots via CSI that basically drive the capability for OADP or other third-party vendors to be able to snapshot and recover persistent data. So just getting a little bit more into out of the box. When I say out of the box, I mean, you don't have to do any configuration if you've got availability zones. You're deploying on AWS. OpenShift and ODF are going to basically schedule evenly across the availability zones. So you'll get protection from one availability zone outage and everything should still work. Now, not all applications that people deploy are topology aware. So you do have to make sure if you want to survive an AZ outage that your application is topology aware, meaning you need replicas of your application, and you need to make sure that if an AZ goes down that the application is either already running on the available AZs or it can be scheduled on an available AZ. So we've been doing this for a long time and the main thing here is there is latency limitations. So because you've got SCD that has to have quorum and has latency requirements and then you have the storage. So in reality, you probably are looking at something around one to two milliseconds, which is not a lot of distance between these AZs and most of the clouds. So you have, you know, you have protection, but you don't have really area of protection. So we go to RegionalDR, which we're going to do a demo of in just a couple minutes here. So RegionalDR is the one that I particular have been focused on and that we just released at Red Hat late last year, meaning the components. So the components based on the slide we had a while back are advanced cluster management. That'll run on a hub cluster. So it runs on its own OpenShift cluster. And in this particular solution, you would not have it in the same region as you have the region one or region two. The reason is this solution requires currently that your hub clusters stay online. But the other two clusters, this solution is meant to protect against a complete cluster outage. So what we're doing is we're using ACM, we're deploying the application via ACM, and then we're replicating via the ODF mirroring from one cluster to the other cluster, the persistent data on a snapshot basis. So say, you know, like I said, every five minutes, any the delta data from each volume that is being mirrored is copied over to the alternate cluster. We're also keeping track via like a GitOps or something like that of the changes in resources. So whatever ways ACM can install an application, that's how we're going to reinstall the application if needed on the passive side. Now the active passive only applies to a per application basis. So I could have on cluster one an application that is essentially passive and have it active on cluster two. In that case, the asynchronous volume replication arrow would go in the opposite direction. Okay. So I don't know if that makes sense, Michelle, but it's not, it's not a, you know, there's not like an active cluster and a passive cluster. There are active instances of the application. And on what we call the passive side, the application is not even running until you have a failover. And we'll see that. And so you're not using resources on the passive side until the application needs to failover. Okay. So active and passive is determined at the application level and you could decide that you want to split them up for whatever reason. Exactly. Yeah. Yeah. For example, if you didn't, you know, if you didn't know what cluster or what region could go down, you might have critical applications and you know, you would have, you would sort of have some sort of like a blue-green deployment, you'd have some in one cluster and some in another cluster, either being active or passive. Great. We talked about RPO. So this definitely the solution with the automation now that with ACM and ODF and ODR, Openshift Disaster Recovery, this gets down into the minutes of both RPO and RTO, which is pretty low when you compare it to, say, if you had to do, you know, a backup and recovery kind of strategy. The other one that we talked about and it's not available in the downstream products yet is MetroDR multi-cluster. So from the ACM point of view and the ODR point of view, very similar in terms of all the automation and capability, the difference is it's going to use an external storage plane or an external ODF cluster and that cluster is stretched. So it's just a single cluster and it is synchronous. So we have latency requirements and we have to have a third location to be the sort of consensus or arbiter, sometimes called witness, but we have to have a third location where one of the storage cluster components has to live in order to be able to establish quorum. Okay. So you would not even consider this unless that is stretching the external ODF cluster made sense. Correct. Yeah. So, you know, here we're talking maybe like a couple hundred miles that these locations could be separated by the arbiter node has less strict latency requirements. It could probably be, you know, up to 50, 60 milliseconds away, which is basically from a way of packet goes the distance between, say, you know, California and New York. But yeah, the data part of it would be limited. But you do get the synchronous capability, which means that your RPO is basically equal to zero. Your RTO, you would still have a non-zero RTO because you do have to do the automated failover and you do have to reinstall the application on the alternate cluster. Okay. So I think we're done with that. If we could queue up this video. And do we have any questions that I have an answer? No, not other than that one. No. Okay. So let's queue up the video and you should start playing. I see the question out. It is an alternative to CAS and Trilio, I think. Okay. So the answer is, ODF, again, is not an alternative. ODF just supplies the snapshot and clone, but OVDP, Openshift APIs for data protection, is an alternative that you could try out in Operator Hub. There's no additional subscription charts for that. But it doesn't have, you know, the bells and whistles of an actual backup, enterprise backup product. Okay. Okay. Should I roll the video? Yep. Hopefully everyone can see this. Okay. Yeah. Thank you. Okay. So what we have is we have, just one question. I'm seeing our photos right at the bottom. Maybe we should just... Are they in the way? Are they in the way? Okay. Can we just maybe stop the video? Sure. Yep. I paused it right there. If we can't change the location, maybe we could just mute our video. So... How's that? There we go. Perfect. Yeah. Can you stop the video and... Yep. Do you want me to go? It's paused at the moment. I can go back. Okay. All right. Let me just start up then and I'll tell you when to start it. Okay. Okay. Because the video is going to roll whether or not I do. Thank you. So what we have here, I'm just going to go through the flow and then we'll actually show it working. But again, this is what we call regional disaster recovery with advanced cluster management. And we have two managed clusters. They have to be managed by advanced cluster management. They could be imported or they could be created on AWS, VMware, any of the things ACM supports. Main things they have is they have OpenShift Disaster Recovery cluster operators. They have OpenShift Data Foundation, which is how you get OpenShift Disaster Recovery operators. And currently the application we're going to look at is installed on the primary cluster. On the ACM cluster, we have the OpenShift Disaster Recovery Hub operator. And that's where the application was installed from. And the managed clusters are also managed from there. So you can continue now. Okay. Rolling? Rolling. Okay. So what we're going to look at here is sort of the flow. So the administrator of the OpenShift would go ahead and initiate the failover. This is done using a custom resource on ACM. And then that will populate the metadata for the volume into the secondary cluster. We'll demote the storage. As soon as we demote the storage, the app is out. We promote it. And then we go ahead and reinstall the application on the secondary cluster. ACM does that. The IOs are redirected and the application on the other cluster is deleted. So within ACM, we have these clusters were actually imported. One is in AWS, US, East 1. And one is in AWS, US, West 2. About 50, 60 milliseconds apart. What's important for this solution is we have to connect. So we use another ACM feature cluster sets. And within cluster sets, we have add-on called the submarine or add-on. This allows the cluster and service and network, which are private networks, to be connected so that we can do the mirroring, the replication between the clusters for the volumes. So this is all done via ACM and works quite nicely. So now let's look at the three OpenShift clusters. So cluster one is a managed cluster. And to get this set up, I installed OpenShift Data Foundation. Then that gets me OpenShift DR cluster. And I also installed actually ACM installed the Submariner operator. On cluster two, I have the same thing. And what's important here is in terms of getting these, you need to get the OpenShift Data Foundation advanced skew with OpenShift Platform Plus. So that's how you get the ODR operators. Cluster three, if we look at all projects here, I have, I installed advanced cluster management. I installed a new operator that does sets up the mirroring ODF multicluster orchestrator. And again, the OpenShift DR Hub operator is available when you get the advanced skew with ODF. And it is the one that we're actually going to use. We're going to use a custom resource in there to actually do the failover and failback. So as I said in the slides, we have to have installed the application via ACM. And ACM has lots of ways to install applications. This one was installed using GitOps. So if we look at the application after it's been installed, the topology is available. And right now, the application is running on cluster one. And we have one, it's a simple application. It just write every second it writes a 4KB file to the storage. So in the terminal view on the top, we have cluster one on the bottom, I have cluster two. And if we look at the bottom, and we do an OC, Git pods, PVC, there's nothing in the namespace. So the application, as I showed, is not running yet on cluster two. It is still running on cluster one. And what we want to do now is we want to go ahead and do a failover. So we do that by using a new custom resource in the new operator for OpenShift DR Hub operator. We, it is namespace scoped. So you have to go to the namespace that you want to make the change. So this is not a per application basis. And the name of it is DR placement control. So this is the new custom resource that allows us to be able basically to just modify this and fail the application from one cluster to the other cluster. So we do that. And before I do that, let me just show you this application in addition to writing to the storage has a Grafana dashboard. And it just shows us just visually where the application is currently running. It's a preferred cluster on cluster one. The green bar just means that it's currently active there. And we can see on cluster two, there is no data because nothing's running there yet. So we will see there's a little more on that dashboard after we do the failover. So if we go back now to the DR placement control, all we have to do is change the action in this custom resource to failover. And in terms of where it's failing over, that's also on the custom resource. So the failover cluster is cluster two. And that's something that you put into the resource. Once you save that, the actions I showed in the flow chart start to happen. And we can watch that a couple different ways. We can go back to the terminal. And if we now, I'll put a watch on it. And actually, you can see 19 seconds that already switched over. When we go back to our dashboard, we see a switch. And the 39 seconds, I want to just explain what that means. So this dashboard is displaying essentially the amount of data that was lost. So what that means is when the application came up on cluster two, there was a snapshot that had just been done. And the only amount of data that was not in that snapshot was 39 seconds. So if we were to look at the data in the volume, we would see all the data was there from cluster one that happened when it was on cluster one. There'd be a 39 second outage of data, meaning no data. And then you would see the 4KB files being written again. Can I ask a quick question, just in case I missed it? So the control of... Oh, yeah. Got a pause. Sorry. Thank you. So to correct me if I'm wrong, the control of where the application is currently running is, did you do that in cluster one? You went into an operator and you changed something there? Or is that at the ACM level? Are we So when you deploy the application, there's something called a placement rule. This is an ACM thing. Okay. So the placement rule will decide what is essentially the primary cluster or the preferred cluster for that application. So in the same view that we were looking for DR placement control, there's another API called... Well, that's not that one. But placement... The placement rule is an ACM thing that says, where do you want me to put this application? Okay. Then what DR placement control does is it says, okay, based on the placement rule, this is the primary location for the application, but the alternate location, the failover cluster is this location. So if the primary cluster completely disappeared, no opportunity to go in and change anything, ACM, located elsewhere, yeah, would pick up and say, oh, my preferred cluster isn't available. Let me move. Yeah. Well, currently, that part is not automated. So what would happen? Let's say you had a lights out on cluster one, just no communication. Just like I showed in the flow chart, the OpenShift admin would still need to go to the hub cluster and manually make that change. Okay. Yeah. Thank you. Should I play? No, we can definitely continue now. Okay. Okay, play. So we're now on cluster two. And as I said, very, very sort of amount of what would be essentially... You have an RPO. So in this case, we were only able... If our RPO was five minutes, we did well. So what we want to do now is to go back or fail back to the original primary cluster. So to do that, our action is relocate. We save that. Again, similar to failover, as soon as you save it, the actions start to happen. And you can see on the bottom in cluster two, the resources are terminating. And we'll put a watch on the top cluster to watch the resources and the application be redeployed by ACM and using the replicated volume. So one difference here when we do relocate, we're not going to have any data outage because relocate causes the volume to be synchronized at the time that the failure happens or at the time that the administrator does the relocate. So we won't lose any data with a relocate. But we will still have an application outage, as you can see here. Once the application is terminated and reinstalled, it takes time. So we had about, in this case, about a 40-second application outage, but no data loss. So I think that's about it, Michelle, for the demo. You can go ahead and stop it and go back. Wow. So at the moment, there are no more questions. And I'm looking at Chad. I'm seeing, that was wonderful. I thought that was really fantastic. And I think I remember at one point even doing some of this manually based on some of your instructions. So this is a big improvement. Yeah. Yeah. And I just want to make it clear for anybody who is watching this or watches the future that as of January 2022, all of this is available in downstream products. So what you'd be looking at is OpenShift 4.9, ACM 2.4, and ODF 4.9. Okay. So if you get that combination together, we have a very detailed guide in our official documentation on how to set this up and feel free to reach out to me. I don't know if we can put my name in the... I'll put it in some of the blogs so people can find the video. Okay. Yeah. Yeah. But we're excited about this. And we really want some people to try it out and see how it works for them. It's a really great demonstration. So thank you so much and thank everyone for attending. Thank you. All right. Thanks, Michelle.