 Hello, welcome to KubeCon 2021. Today's topic is disaster recovery, the ability to recover from a full data center loss, making sure you have the data and can recover your application. My name is Oik Vassaman. I'm Oki Chief Data Foundation Architect at Red Hat. And with me today is my colleague, Shama Wankh, who got fun from Red Hat, who is also an architect in the storage team. Today, I will go over the fundamental mention of disaster recovery and then talk about storage application in Kubernetes. Then, Shama will go over to multi-cluster management to become very located in orchestration. Then we'll have a short demo and talk about a future work. A demo will be based on SAF, with a software defined storage, completely open source, quite mature. The IDE behind SAF is installing the software on standard servers, using standard disk from HDD to NVMeet with system memory, and using standard IP network to get a real very reliable, highly available, very scalable distributed storage system. In a single-step cluster, you get all the storage need. You get cloud of storage with Red Hat's gateway, block storage from RBD, and set the first to provide you with distributed files. In order to deploy SAF in Kubernetes, we are going to use VOOC, a CNCF-graduated project. You can manage SAF, deploy it, configure it, activate it, and find features like a disaster recovery and application using basic Kubernetes, with the set operators and custom resource definition. Let's talk about disaster recovery. Disaster recovery is making sure we can store our customers, even if you lose a data center or a cloud region. And suddenly, this can happen. On the left, for example, we have a photo of OVH, one of the largest data center in Europe that was completely destroyed by fire last year. In most cases, our disaster recovery site should be far, and that will mean high net of latency. In that case, we won't be able to use synchronous application, but asynchronous, which means we don't wait for the ride to sync. And that will mean there will be some data loss in case of recovery. We can maintain low enough net of latency and can use synchronous application, then we can even get high availability. In this talk, we're going to focus on regional DR, which means two separate remote sites with high net of latency, two separate Kubernetes cluster, two separate devices. In case we can stretch the storage system, this will sometimes refer as Metro DR zone there. A very important way we measure the disaster recovery are RTO and RTO. Recovery point objective is the amount of acceptable data loss in case of recovery from disaster. It's measured in time unit. Recovery time objective is the amount of time that application can be done before it will impact the business. Again, measured in minute. So what is the ideal disaster recovery? We want the RPO, the amount of data will be as low as possible, minutes, no more. We want to reduce the RTO, the amount of time it takes us to recover from when we detected the disaster. It will be much easier if we can have a single plane of glass to focus it, the fellow and fellow back, much better user experience. Example is Chef, Rook and Chef regional DR. We have two cluster, West and East. Each has their own Rook-Safe deployments. We are using Chef and Geo-Oplication. In this example, RBD are synced, not knowing that it's based on detection. And we have the TVs active and writable in cluster West. And they are standby replicated to cluster East. We can have two ways of application, which meaning East can be a standby to West and West can be a standard to East, much more efficient use of resources. And in that case, we can support up to minutes of RPO and few minutes of RTO. So what would happen in our Kubernetes cluster when we have a region failure? In that case, we will recover to the East region. This is done without any cooperation from the storage and Kubernetes cluster in West. We will promote forcefully the PVs and make them primary and writable in cluster East, making them unavailable in cluster West. We may have some data loss. When you want to recover back or relocate back to cluster West, in this case, we don't want to have data loss. And in this case, we'll actually have both regions participating in the relocation operation. That's the way we think cluster East catch up. Then the application and the PV will be down in cluster East, making sure all the data is synchronized to the storage system. Cluster West will take only the latest changes and then we'll make the PVs primary again and writable in the application in cluster West without any data loss. So let's talk about storage application. But why can't we use backup? So backup are great, very important. You should backup your data always, very important. They are bad. Let's think about how frequently we backup our data, daily, in extreme cases, every few hours. This means that RPO, the amount of data we can lose in case of disaster is ours, lots of data. Then let's think about recovery. In case of recovery, we're doing our restore from backup operation. This is usually also a very long operation, very high high OTO. Then let's think about the amount of data we need to transfer in case of disaster. So let's say we backup to a remote object storage. So in case of recovery, we need to read full backup in order to restore the data. All the data, that's a lot. If we use incremental backup, in that case, we can use less data. But it will increase our TO, the amount of time of restoration. Also, incremental backup requires us to be able to detect incremental changes to the snapshots. We took, in Kubernetes, there's enhancement to support chain block tracking, CBT, snapshots, but it's not GA yet. Let's think about Kubernetes resources. We have a point in time between, we may lose some resources state or changes. But do we have to? They don't change that frequently. We can do it seamlessly. Or maybe in a declarative source, like a Github, in that case, we can even get to RPO zero. Meaning we won't lose any data in case of disaster. We have the exact resources we had in the primary cluster. Luckily for us, storage systems do have your application built-in. Usually it's based on periodic snapshots. And normally, that only transfers incremental snapshots, the delta change between the period in that period. That will allow us lower RPO. And they also have much more efficient storage traffic and separate between the storage traffic and the user IO. And recovery is almost instantly or very close because we already have the latest data in our recovery site. So the RTO from storage perspective, it's very, very low. So it's the system has its own API, it's on configuration, it's on flow, and quite lots of complexity of the system that are using the girl application. So what we desire is more standard API. The idea is that we want an API that will allow us to set up the replication relationship between the two clusters and allow us to manage the application state and also in a can support vendor-specific management. What we did in SAS CSI, we extended the standard CSI API and added a new resource, volume application. Volume, when we create a volume application, we basically enable application for that volume between the two clusters. When we delete the resource, we disable the replication. We have the application state, meaning if the volume is primary or not. In this case, the volume is primary and east of our primary site and is actively available in east and replicated to us. The data source will point to the PVC that is being replicated. We have a new CSI SATCRAV that is handling the reconciliation of those resources. This is an example of this. In order to enhance more capabilities, we also added volume application class. The class contains secrets we need in order to do the storage system to be able to communicate. We can define a replication schedule as not all application. I need the same schedule. Some don't change the data much and we can add longer periods. And of course, to store all the vendor-specific parameters needed for the storage application. So I will recover a look. So we already have a volume application created in cluster east and we define that PVS primary in cluster east. Now that we know cluster is held, we want to recover the cluster waste. We'll create a new volume application resource in cluster waste. Now this application stated primary, use the same data source, the same pbc as we did in cluster east. This lifestyle will detect the change and use your grpc command and communicate with the CSI that will modify the storage and will post promote cluster waste PV, making it writable and primary. We locate on the other, we require changes on both cluster. So there are the applications that we need to change the secondary in cluster waste and primary in cluster east. And again, the side cut will, in both side will do the reconciliation and then the drivers in both sides will apply the needed changes. Are we done? Well, this will work great if the pbc is practically provision. What about dynamic provisioning? So here we have a problem. If the pbc provision dynamically, when we want recover, for example, we need to attach pbc not to a new dynamically provision, pbc and pbc, but to an existing pb, the replicated one. So we are actually create a matching pb in cluster waste that is connected to the volume that we replicate to cluster waste. Then when we recover, we won't use the data source. But we need a way to tell Kubernetes we want to use the existing pb. So we want to connect between the pb and the dynamically provisioned pbc. So now we won't do a dynamic provisioning in static. We connect the pbc to the pb that is connected to the replicated volume. If we talk about relocate, that will require also and changing both the volume application on both sides. We are using a volume handle and a core storage system to identify those pb. And it works well for us since it's FCSI, but it may be a problem to other CSI or storage system. And we need to standardize those. Now let's go to Sam, who will give you more detail about multi cluster management. Thanks, Orit. Let's talk about multi cluster management. So we've seen that disaster recovery requires multiple clusters, peer clusters, so that workloads can be recovered or relocated across each other, which basically means we need clusters that are configured equivalent so that the applications can run across these clusters. We also need that any custom resources that these applications use, their operators and custom resources are deployed on both clusters or all clusters. And we've also looked at storage. Storage has to be set up across these clusters in an equivalent fashion. Such that the volume replication class, the storage class, everything is equal and that the application can be redeployed onto a target cluster. So from a cross cluster action perspective, cluster configuration is something that we want to maintain across these clusters. From a user perspective for application recovery and relocation, users would need access to congruent namespaces in these clusters where they can place their workloads. They would also need a declarative copy of their resources, the application manifest that they can recreate on these clusters. And finally, they would need to reroute their global traffic manager or the inbound traffic to the cluster that's actually running their workload at any given instant. Further than this, across these clusters, we also need a level of health monitoring for alerting when a cluster is available or unavailable for various reasons. And finally, to do the recovery relocation orchestration, which is a cross cluster action, we need to manage multiple Kubernetes clusters. We chose the Open Cluster Management Project, which as the tagline goes, is a community-driven project focused on multi-cluster and multi-cloud scenarios for Kubernetes apps. Works perfectly for us. We are looking at multi-cluster and managing apps across multiple clusters. They provide a cluster registry, distribution of work across clusters in the registry, placement of content in a vendor-neutral API manner, which again is useful for wider adoption. We leverage Open Cluster Management, primarily for the cluster configuration and more importantly, to manage the application lifecycle. So we need application manifests coming from a declarative source and OCM has a channel CID, which is basically pointing to a GitHub or an object store for the declarative copy of the application manifests. OCM also has a placement rule CRD, which decides where an application, which clusters an application should be placed to. And so that's leveraged so that we can do the orchestration during disaster recovery by scheduling the placement rule appropriately. There are gaps. Disaster recovery orchestration is not a part of it. They are more around stateless apps, which is where we come in and we'll talk about what we're gonna do about that. So for disaster recovery orchestration, although we have these various components, what does it mean for a user and how easy or complex is it? So as it stands, it actually becomes quite complex as we look at three use cases of deploy, relocate and recover. Let's start at deploy. So for example, the user would have to deploy their application resources to a cluster, let's say East in this particular example. They would have to create volume replication resources as primary to establish volume replication for every PVC in this particular namespace. They would have to ensure that the volume replication is occurring before they decide to back up the PV cluster data, because that's what's gonna be used in the alternate cluster, as Orit mentioned, to reattach to the volume in the storage vacuum. And finally, if there's a new PVC that's created by the application in the namespace, they'll have to repeat the production for that particular PV. Not too bad, but let's go to recover and see how it starts increasing complexity. So the event that the East cluster goes down, the user would have to first recover the workload on West and hence have to restore the PV cluster data first so that the PVCs that are part of the application manifests as they are restored, reattached to the respective PVs and then they would have to create the volume replication resource to mark these against these PVs so that they can be marked primary and they're ready for use on the West cluster. Technically, at the point that we've covered these steps, the RTO goal is met. We have recovered the application in some time, but it doesn't end there. When East cluster recovers, in this particular case, East is temporarily offline, let's say. We would have to ensure that the application resources on East are deleted. The volume replication resource is marked as secondary so that it starts re-syncing data from the new primary on West. And then once re-sync is established, we can delete the volume replication resource, which would in turn delete the PVC resource and free up East cluster. And of course, once this happens, as is usual, we're gonna have to protect any new PVCs that appear on West as part of the application's namespace. Relocate increases the complexity because we need to ensure, first, we deploy or remove the application from the West cluster for which we need to make sure we delete the application resources on West. We need to ensure that VR volume replications marked as secondary. We need to wait for the final sync of data to be reported by volume replication resource because relocate is when both clusters are active and recovery point objective is zero. In other words, we need all the data. And once the final re-sync is complete, is detected, we can get rid of the volume replication resource, et cetera, on the West cluster. But then we can start bringing up the application on the East cluster, restoring the PVs first, recreating the apps, recreating volume replication as primary. So there are quite a few steps to perform these actions. And so what we did next was to create a VR orchestrator called Ramin. What Ramin does as per its tagline is instant cloud native workload recovery and relocation across Kubernetes clusters. Yes, we are interested in recovery and relocation cross-cube clusters. That's the use case we talked about and that's what Ramin helps us with. Ramin basically enhances the OCM placement rule scheduler by adding its own scheduler so that it can orchestrate workload placement in the OCM control plane. It further also provides for label selectors for PVCs that need protection and auto-create volume replication relationships for those PVCs as primary or secondary and manages the replication state so that users do not have to worry about dynamically created PVCs, for example, when stateful sets are in use. So Ramin provides two APIs, the first one being a DR policy API, which is a cluster-scoped resource, which talks about which pair of clusters are in a DR relationship so that we know that the cluster configuration is equivalent storage setup as needed across these two PR clusters. It defines an application schedule which basically defines a recovery point objective for the application that wants to use this particular policy. And it optionally has a volume replication class selector to disambiguate in case there are multiple volume replication classes with the same schedule on the clusters. The DR policy object is a cluster-scoped object set up by the administrator. And then the next API, which is the DR placement control API, is per application and the namespace resource, which basically helps control the placement and orchestration of an application across these clusters. So it reconciles the placement rule that's referenced by spec placement ref in the DR policy placement control API. It refers to the DR policy so that it knows what the scheduling interval is and or which clusters this workload can be placed on. It auto-protects PVCs based on the PVC selector that's provided in its definition and its spec. And it provides for two actions, recovery action or a failover action, which moves the workload to the failover cluster and the relocate action, which relocates the workload to the preferred cluster. Initially, a preferred cluster could be specified if a particular region is desired for an application or left empty and dynamically scheduled on either cluster and the referred to in the policy, the DR policy reference. So just this DR placement control API, we could relocate or recover an application and we'll soon see a demo of that. But before we go there, how is Raman deployed on these various clusters? So Raman has two operators. One of the operators is running on the OCM hub cluster, which is the multi-cluster orchestration plane, where the DR policy and the DR placement control objects are created. Its responsibility is obviously to reconcile the DR placement control object. On the managed clusters where the workloads will be running, there is an additional volume replication group API resource that is present and managed by the Raman cluster operator. We're not going into that because it's auto-managed by Raman, but volume replication group API is the one that helps detect new PDCs that are created and when PDCs are no longer in use so that a final sync is initiated with the volume replication. And all in cluster activities are managed by the volume replication group API, which is controlled by the DR placement control API at the hub. So with that, let's go into a quick demo. What we have here is a demonstration of how we control relocation based on DR placement control. We have three clusters, East, West and the hub cluster, the hub cluster being the Raman hub, East and West being two managed clusters where applications will be deployed to. What we've actually done is we've already deployed the application on the East cluster The application that's been deployed is a busybox pod in the busybox and sample namespace. It very simply echoes the current timestamp to a file every 10 seconds. And the file is actually an amount point which is backed by a PVC, which is backed by CFRBD, which is set up across East and West for storage replication. We deployed this so obviously the DR placement control on the hub is responsible for having deployed this particular resource. Let's take a look at that. It basically has a DR policy which is the policy has East and West clusters in it. It is going to reconcile the busybox placement rule. And finally, it has a preferred cluster mentioned in it which tells it which cluster to prefer and it was East and that's why it's deployed it on East. And it's already set up volume replication and other such requirements for the PVC in use. And the user doesn't have to worry about those things. Now, as we are writing timestamps to the file and we want to demonstrate recovery, let's take a look, let's tail the pod, the busybox pod so that we know the timestamps as we come back to the hub to perform a recovery to the West cluster. And we'll see what kind of data loss we encountered. So to perform a recovery of the workload on the best cluster, we need to edit the DR placement control and we need to provide it with an action of a failover. And we need to give it a failover cluster and that failover cluster in this particular case would be West so that the workload can move from East to West without any coordination. It assumes East is down, although East is really not down in the example. So now that we've committed that, Diamond started moving it, watching for the busybox pod on West and once that comes into running state, we'll take a look at the data. Speeding up, we see the last series of timestamps that have been dropped on the East cluster, which is what we'll compare with the West cluster. And on the West cluster, we see that the pod is already running. So let's take a look at the data within the pod and see what data we lost because this was a recovery operation, a failover operation that we performed. So looking at out file here, we see that on East, we actually wrote 3755 timestamp in a series of 38 timestamps and a 39 timestamp. And so on the West cluster, we look at what the timestamps are and how much of data did we lose. So if you look at it here, we got 3755 timestamp, we lost the series of 38s and 139, which was about a minute worth of data that lost. The replication schedule that we've actually set up for this demo is one minute, so it lost about a minute of data from East. It started running again at 39.13, so the recovery time was approximately a minute and 20 seconds, give or take. And how did we achieve this? We basically edited the DR placement control for an action of failover and gave it a failover cluster and let the entire orchestration happen for that sake. Next part of the demo, we're gonna do a relocate where we're gonna show zero data loss. So let's start tailing the busy box board on the West cluster. And in this particular case, we're just going to see that there's zero data loss. So we need all timestamps back on East once we relocate it. And to relocate it, we go ahead, we edit the DR placement control object again. This time around, we change the action, instead of a failover, we're gonna say relocate. We already have a preferred cluster of East down there. So it's just gonna relocate to the East cluster. Failover cluster is still in there, but that's immaterial because it's not gonna be used for this purpose. So moving on, watching for the pod on the East cluster, which is where it's getting relocated to, it's three tailed logs on the West cluster to make sure all timestamps made it. After some time, we find that the pod is running on the East cluster. So what we're really looking for is the 4309 timestamp should still be present on the East cluster to ensure RPO of zero. So let's get the logs, sorry, get the content of the volume of the East cluster and take a look at the timestamps. So we were looking at 4309 and on the East cluster, we have 4309 timestamp present. And if you look at it, the next timestamp is 4439. So it took about a minute and 30 seconds, which is your RTO. Although this is not purely recovery, but there is an application downtime during relocation. And how did we do this? Again, we just edited the DR placement control, changed the action to relocate. It had a preferred cluster and moved over there. So DR placement control basically simplifies is the tip of the iceberg, simplifies the complexity of the remaining parts of the stack that we are using volume replication, storage, application management with OCM, and provides an easy to use interface to actually orchestrate relocation into recovery. Let's look at future work in this area. So what do we want to do next? We talked about regional DR. That is the Metro DR case where storage, the assumption is storage replication is synchronous. So there is no data loss. There is RPO is zero. But we would still need to recover and relocate applications across these clusters. And we might want to tackle storage fencing when we know that a particular cluster is unavailable, because we still be accessing storage. And we do not want multiple writers to the same storage endpoint corrupting the data. So that would be an interesting use case to tackle a same or a similar stack. Upstream data protection working group, SEG is actually provided any volume data source feature gate, which allows a PVC to be created by an arbitrary kind in the Kubernetes cluster. We want to leverage that so that we can use volume replication as a kind to create a volume, to create a PVC from, which can help us move away from copying PV across and static provisioning and assuming that the PVs CSI handling, a volume handling across storage vendors across clusters can be reused in the same fashion. From a replication consistency perspective, the storage replication is based on storage snapshots and hence it is crash consistent. It's not necessarily application consistent. This is what's going on in the data protection SEG for application consistent snapshots. We need to integrate or leverage that to provide application consistent replication snapshots so that they replicate them as required. To further improve consistency, applications do not always use a single PVCs. They have a group of PVCs. Again, there is work going on around volume groups and volume group snapshots in the data protection SEG and when that becomes a reality, the replication of these set of PVCs will also be point-in-time consistent across each other, providing better resilience and recovery for the application on the target clusters. We'd also want to move towards more data agnostic replication mechanisms. For example, proposals like the change block tracking can help reduce RPO and transfer data across various storage systems. They don't have to be in the same storage system so that can make us storage agnostic and address RPOs instead of backups using backups, address RPOs using change blocks across snapshots. I would also look at providing more orbit-free replication scheme than volume replication so that data can be transferred across peer clusters and the storage systems probably don't have to be the same. It kind of provides for a hybrid approach in the future. And with that, here are some links and references for the various components that we talked about in this presentation and used to build up RAM and volume replication. Please do take a look, participate as the case may be. And thank you, that ends our presentation. We're open for any questions that you may have.