 So everyone, welcome back to another OpenShift Commons. You can see I've changed my background and this is special for today because we have talking storage. Many of you don't know, but those that are here from Red Hat may have seen a lot of our storage discussions and you spent a lot of time with me here in my dining room, so welcome back. And we are also changing it up today. Today we have an upstream discussion. I'm always excited about those. It's so important that we work upstream with the CNCF and today we have members from Tags Storage talking about a topic that is very important to pretty much all of us, especially enterprise installations and anybody that is concerned with how to get your data back or your applications back if something happens. So today we have Alex Cherkopp from Storage OS. He is the CEO, but also he is the CNCF co-chair of Tags Storage and Tags stands for Technical Advisory Group now. It has been changed from SIG and also Rafael Sarkoli. Who is a tech lead for Tags Storage. So very excited to have you both. Thank you so much for joining us. And Rafael, Alex, if you'd like to tell a little bit more about what you do, especially with Tags Storage and then dive right in. Absolutely. Go first. Thank you so much, Kareen. And so glad to be here with the OpenShift community. So the CNCF a couple of years back had created SIGs and SIG Storage in fact was one of the first SIGs that we created. And the purpose was to help the CNCF with evaluating projects and to create content and educate and educational material for the end users and the community. And since then the SIG has been renamed to Tag just because SIGs were getting confused with the Kubernetes SIGs. And as Kareena mentioned, Tags stands for Technical Advisory Group. So the co-chair of the group together with Xinyang and Quinton Hull. And we also have a number of tech leads of which Rafael is one of our newest members. So I'd like to pass the baton on to Rafael to introduce himself to you. Thank you, Alex. Yes, I work at Red Hat as an architect in consulting so help customer build their clownative solutions and recently joined the Tags Storage to work together on the topic that you see on the screen today which is a clownative disaster recovery. This is a work that I think started about nine, a collaboration that started nine months ago. I think it was fall of last year when we started talking about if this was a good idea or not to explore in this particular tag, right? Or it should belong somewhere else. And then we started sharing some ideas and we are now creating a white paper that I think we're gonna publish soon. And today we're gonna present some of the results of this white paper. Indeed. So this is particularly exciting for the tag because we've had a lot of demand and in fact a lot of feedback in the different tag calls which Rafael has been very patient with and also it's meant that we've been able to iterate on the document. And just for everybody's benefit when we talk about storage in the CNCF and persistence we're not just talking about a volume or a file system that we cover kind of any type of persistence layer including databases and object stores and key value stores for example as well as traditional volumes and file systems. And I think the key thing here and keep me honest here Rafael I think what we're trying to do is to make sure that developers and DevOps teams and the series et cetera have the information and the tools and examples on how to adopt some of these different technologies in a cloud native in a cloud native way because for the first time in a long time it's no longer infrastructure teams that are making these decisions but it's developers and their deployments and understanding the storage subsystems and the different technologies that are available with cloud native technologies is extremely exciting and enables so many new use cases which Rafael will cover shortly. Yeah, that's exactly right. I totally agree. And what we would like to do today is to present an approach to disaster recovery that we call cloud native. Obviously it focuses on stateful workloads and that's the storage workload that Alex was mentioning all of them. It turns out that the hard problem that you have to solve when you distribute a stateful workload is always the same regardless of the kind of interface that you expose. So it doesn't really matter if you are exposing block storage and a block storage interface or a message queuing or a SQL database, the internal sync is really the hard problem that you need to solve and that's what we explore with this white paper. But from a user perspective, the concept that we would like to make people aware is this new concept of cloud native disaster recovery. This is, we call it cloud native disaster recovery. I just said new concept, but the approach, it's something that you could have done even in the past. Our point is really that with cloud native approaches, it now becomes less complex and more easy to create these architectures and this deployment and probably less expensive. And the way we define this cloud native disaster recovery is by contrasting it with traditional disaster recovery. We had an internal discussion in the tag whether calling this traditional disaster recovery was correct or not. By traditional disaster recovery, we mean what you normally would find in many enterprise customers. So not the big web scalers, not the newer startups, but the enterprise customer that many of us work with. So let's go down these columns and rows one by one. I'm going to try to keep it brief because we don't have a lot of time, but and create this contrast and talk about what cloud native DR is. Type of deployment. So are we deploying active active or are we deploying active passive? In most traditional DR scenarios, what we see is an active passive deployment, especially for the state workload. You know, you may have state list workloads that are active active, but they all point to a single active site and then there is a passive site for the stateful part. That's often what you see. For cloud native DR, we are proposing that it should be active active. And then obviously we're talking disaster. So there is a disaster situation, how do we detect it? With traditional DR, in most cases, it's a human decision. Somebody who says, okay, this is really a disaster. We need to trigger the disaster procedure, recovery procedure. It's a human decision. In cloud native DR, we want it to be autonomous. We want the system to realize something is going wrong and react to it. And then the recovery procedure itself, what we see is normally a mix of manual and automated tasks. Maybe the tasks are mostly automated, but somebody is triggering them in a manual way. That is at least in my experience what I see the most. In cloud native DR, we want full automation on the recovery. Whatever the stateful workload needs to do to reset itself or reorganize itself and the replicas and the petition to continue servicing. And then the two metrics of DR are RTO, the recovery time objective, which is how long the system is down and how long, how soon we can get service up and running. And recovery point objective, which is a measure of how much data is lost because of the disaster. And it can also be a measure of how much inconsistency I have created if I have multiple copies of the data. So for RTO, for traditional, we usually have good enterprises that are close to zero, but normally we have hours to recover from a disaster. In cloud native DR, we want this to be close to zero. Essentially, a couple of health checks have to fail and then we realize the situation is a disaster and the system will start bringing the traffic to the healthy locations. And for RPO, again, you could close exactly zero to, again, hours of loss. I usually see, or there are minutes to hours in terms of loss, data loss. In cloud native DR, you have two options here. You can do strongly consistent deployments, which will have exactly zero data loss. And you can do eventually consistent deployments which may have close to zero data loss in normal circumstances, but you're not guaranteed in this situation that the data, once it's reconciled, is exactly correct. It's gonna be consistent, but it may not be correct for your application. So that's a caveat of the eventually consistent systems. And then from an organizational perspective, what we notice is that in traditional enterprises, application teams are normally responsible for the business continuity plan of their service. But what they do is they turn around to the storage team and ask them what are the RTO and RPO that you can guarantee for the storage that we use and then they adopt that as their standard. So the fact though is the storage team is the driver of the enterprise DR strategy. And when instead we are arguing here that with cloud native DR, it's gonna be the application team that will have to choose a correct piece of middleware, a correct product to handle their state, right? And then they will use that product to organize the DR procedure around it. And then one other observation that came up working with this new technology is that DR, for traditional enterprise, it's really the capability that you need to do disaster recovery and we instantiate this procedure are coming normally from storage in the shape of maybe the cap restore or volume replication, whether synchronous or asynchronous volume replication. Instead for cloud native DR, we noticed that this capability are coming more from network again. We see the need for East-West communication between your payroll domains, so that could be regions or could be your data centers so that this middleware that is new generation middleware can, the instances can cluster up, can find each other and cluster up. And then we see the need of global advances in front of the regions or in front of the data centers which we need a way to direct the traffic to the healthy locations. So it needs to be a smart, intelligent global advance. Hopefully we'll give you an idea here of the differences between what we call traditional and cloud native DR. Go ahead, Alex. I was just gonna say one of the key things here is what we're talking about, I think. And again, keep me honest here, Rafael, is we in a climbing to worlds where applications are effectively composable and the infrastructure is declarative. What we're talking about is cloud natives gives you a lot of the tools that are available to automate and to manage disaster recovery just like any other healing process that would take place in a standard cloud native environment. And so we kind of acknowledge the fact that this isn't necessarily straightforward. Some of these technologies are also fairly advanced but the point is that these new cloud native architectures and Rafael, we'll talk about a reference architecture with some of the options available actually enable organizations to have automated failover and automated disaster recovery processes with better metrics in terms of RTO and RPO than you'd get with the obvious manual failovers and manual tasks that you'd have in a traditional system which I think is extremely exciting because what we're effectively saying is we've made applications composable and developers can declare what their applications need from an environment and now we're kind of taking a step further and saying that that also applies to the disaster recovery process. Right, exactly. And it is exciting. I find it very exciting and maybe some of you may doubt that it's even possible or may say the story that I've been told many times is I can reach RTO and RPO as close as zero as I want but the cost increases exponentially. We don't think that this is true anymore. It's actually a matter of composing the architecture in the right way but the cost does not increase exponentially and you can actually reach these numbers with relatively inexpensive deployment. And I use the word relatively but certainly not something that grows exponentially. And so talking about how we can build these architectures, here we are showing a strongly consistent climate where we have a state for workload that is capable of handling this horizontal state sink in the proper way, guaranteeing that there are correct replicas in the correct regions or data centers. We need three failure domains, in this case they're data centers, right? We need three of them because otherwise we wouldn't reach the quorum. And as I said in front of the state for workload, normally we have a front end, probably a state that's front end and in front of it we will have a global load balance. So this is a very generic blueprint that you can then reuse in several situations. I failed to mention obviously the state workload, we'll have storage that comes from the local data centers but as you can see, we don't rely on storage capabilities to replicate across this data center. All the interaction is handled by the middle by the state of workload. So how can we build this state for workloads? Because obviously we are now relying more on the middleware, right on the middleware side. I have a couple of slides to talk about this from a conceptual standpoint. I'm gonna try to go fast here, so maybe we have time for exploring one of such deployments. So we need to define the concept of failure domain, understanding the concept of failure domain. It's a failure domain is an area of our system that can go down or can fail due to a single event. So you could have nodes, nodes could be failure domain, racks, clusters, network zones, availability zones, regions data center, they're all failure domains and if you look at them, they contain each other, you can scale out this failure domain from the smaller one like a single server, for example, up to an entire data center. But the theory around distributed state for workload works the same regardless of the scale of the failure domain. Here in this conversation, we are talking about disaster recovery. So we are talking about failure domain being the data center because that's what usually disaster means. I lose an entire data center, right? There is an event and because of that event, I lose the entire data center. So we're talking about our failure domain or references that are the center. And normally, so for disaster, we mean that we lose the entire failure domain or in particular the entire data center if we don't specify what the domain is. And the disaster recovery is what happens when that happens, what is my strategy? Have a bit, it's slightly different concept. I have a bit is when within a failure domain, something breaks, one, maybe I have one fault. What happens inside the failure domain too? Does the service continue or not? And then we have the concept of consistency. That is the property of a distributed workload that all the instances observe the same state. So the state is consistent across the instances. We need that because when we lose a failure domain, we're probably going to lose some of these instances and we need the state to have been synced everywhere so we don't lose it, okay? Yeah, and I think it's fine if you go into the next slide. On that point, the getting strong consistency is probably the single biggest architectural challenge, trying to ensure you have strong consistency which is one of those key attributes in any storage layer or database layer, et cetera, is effectively a balancing act between maximizing latency or even reducing latency and versus availability. And there's a very convenient capterium here which is one of those things where you can have three things, you have consistency availability and partitioning and you can have any need to pick two basically. And I'll let, I don't steal Raphael's thunder, but I'll let him explain some of the details and how it applies to some of the different systems we're talking about. Yeah, thank you, Alex, yeah. And yeah, you announced the theorem correctly, right? It's usually you pick two of these three, consistency availability and network partitioning, I should improve the picture here. I like to say the theorem is slightly differently because I think it helps understand the kind of choices that are made in the software today, which is network partitioning is not something that you control, it's a fault that happens. So a way of reasoning about the capterium is, assuming network partitioning, what do you want your software to do? Do you want it to be consistent or do you want it to be available, right? Because you can only choose pick another one, the network partitioning is not something that you pick, it just happens. And I'm showing in this little table here some of the common software today that where the choice is very clear, right? If you choose consistency, it means you're building a strongly consistent system. If you choose availability, it means you're building an eventually consistent system. Both at Pro and Cons, right? I have their usages, but that's a very clear design choice that you have to make when you build software. And in reality, some of this system can even be tuned, and depending on how you tune it, they can change the behavior from being available to being consistent, but they are all built around this capterium. There is a corollary to be kept in mind about the cap, from the capterium, which is the Pachelk or Pachelk, I hope I pronounced it correctly, corollary, which essentially means it says, in absence of a network partition, so when the network partition is not there, you can only optimize for latency or for consistency. So this is, I was recently doing some experiment and it was very clear what the Pachelk corollary means, in particular, I was doing for Kafka, right? So if Kafka is one of those that is tunable, if you set Kafka to be consistent and then you spread the cluster across highly, the reason that I have high latency, you get a very high latency on the response for a single communication, whether you are reading a message or actually producing a message, reading is always easier. And that's just the nature of this software work. That doesn't mean that, for example, with Kafka, you can't have still a significant amount of throughput, but each individual transaction will have a significant, high latency because you have told Kafka you want to be consistent. Okay, so that's how this work, and it's very nice. It's incredibly convenient to have a theorem to think about this thing because you can take a software that you don't understand, a piece of software that you don't understand and you can ask the vendor or whoever is the expert to talk about the software in light of the theorem and understand the choices that are made and that are made by that piece of software and just that way you will understand a lot of how that piece of software behaves. Alex, were you going to say something? I was just going to say, just for further information, we also have, the tag has created a more generic storage landscape white paper and in that we define all of the different attributes, like availability and scalability and performance and durability as well as consistency. And it's interesting to me because what we have here is different systems will have different use cases and it's probably worth noting that no one system will handle all of these cases because of course, very high or very strong consistency also typically has, might have scalability or performance implications and vice versa, for example. Right, it's a trade-off that you have to decide on your product. So we said we have several instances of this software, right, running and clustering up, creating a logical instance. How do they sync? We need consensus protocols. I'm going to go quicker in this slide, but there are two types of consensus protocol that they call share state and then share state. Share state is to agree on a state that all of the instances need to reflect, need to have. For these kind of protocols, you can have a protocol based on a leader election, that is the one that accepts all the rights and the other one are just followers and they're validated by the community, the algorithm that are being validated by the academy are Paxus and Raft. In this space, Raft becoming more and more popular. So most of the software that we have mentioned so far and we have another list later, they're all based either around Paxus or Raft when it comes to synchronizing the same state. Then sometimes you have to synchronize different states but you have to make sure that these instances have to agree on all writing or not persisting that state. For that we have the one known two-phase commit or three-phase commit protocols. Another thing to know about is like I was saying at the beginning that the hardest problem that this state full of storage software need to solve is really always the same and that is I need to sync with the other peer and then I need to persist that data. Somehow I need to have a consensus protocol, a list of operation that I've happened and then I have to store it to persist those information. There is an interesting chapter in the SRE book from Google where they explain how to theoretically build a piece of software that can be reused across all of these state full workload products because at the core it's always the same. Naturally you should do it that way you wouldn't get optimized performance. So you have to, this is a theoretical approach you have to still make your optimization out of it. But in reality, there are several companies for example that are using RocksDB or there is another one from Apache that there's a storage layer with some level of consensus protocol to coordinate instances. So if you put everything together, you need to have this reliably replicating machines and then you can create partitions so where you separate your data so that you can horizontally scales and then you can create replicas where the partition is replicated in other instances so you don't lose data when something goes down. And so if you look at this picture here on the right, this is the anatomy of a state full application at least a modern one where you have two replicas here. It's replicas of several partitions so we end up with six instances. Instances as its own storage and then we have the coordination layer. Between the instances with the same replica we use a shared state consensus protocol because we are all storing the same state. And then if this particular software supports transactions whereby I do two operations on two different partitions but I want them to be one logical transaction then I would have to have an inter-partition coordination protocol and for that I can use something like phase commit. So this gives you an idea of the anatomy of these stateful workloads that we can use to build what we call the clownative disaster recovery deployments. I think just on that point, Raffaele, I think one of the exciting things here is that what is effectively happening is we're layering different proven technologies. So you might have sharding for performance and you might have rough protocols for consistency but you might also have a variety of different layers in that stack where for example you might have a SQL layer that's using a key value store that's using a sharding process that's using a file system, et cetera. And so it is really more than ever before it's kind of important to understand the different layers because at the end of the day, the attributes of your system and your failover capabilities and your DR capabilities are going to be an amalgamation of all of those different attributes. Yeah, yeah, I agree. I could agree more especially on the observation you made on the interface layer what is called API layer here. We see more and more products now that offer for example a SQL interface and then also a key value store. And it's clear what's happening, right? They are just adding a new API layer but reusing everything that is below it. So it's relatively easy for them to do that. Maybe they don't always get all the optimization that they could possibly have but it's easy to add additional functionalities to your state or workload. Yep, exactly. Yeah, here is just a table representing some of the workloads that we have analyzed and the choice that they make in terms of the replica consensus protocol so the shared state consensus protocol and the unshared state or shared consensus protocol. Consensus protocol. I would highly recommend if you are considering a new state of workload, ask your vendor or your experts what choices do they make in this space? Because that already tells you a lot about the software that you're about to purchase. Some considerations around strong consistency versus eventual consistency. And they both can be approached for cloud native DR but they behave differently. For example, in terms of RPO, as we said, strong consistency is about consistency, obviously. So we don't lose any data once we create a well-done deployment. So it's exactly zero. I had people that couldn't believe it but it's exactly zero. I never lose data with this. Obviously assuming that only one disaster at a time happens but you don't lose data. With eventual consistency you may be losing some data, theoretically unbounded amount of data but in practicality if the system is not overloaded it's something close to zero because it's just what was in the local cache that the system didn't have time to replicate. Another thing that you should consider is when you lose one data center, one failure domain in an eventual consistent system, like I said before, the rest will keep serving. So the two sides of this deployment may diverge in terms of date and when they come up they don't necessarily agree with the state so there is a reconciliation algorithm that will decide who is right but this reconciliation algorithm may not reason the same way that your application from a business perspective reasons. What I like to say is that eventual consistency does not mean eventual correctness in business logic terms and eventual consistency may pose some design additional design consideration to your developers. If it is at all possible I personally prefer to keep things simple for the developers and choose a strongly consistent deployment. I think strongly consistent is the most predictable and one of the points here about the minimum number of failure domains is that with strongly consistent systems we effectively have an odd number of copies because you remember when we were talking about the CAPTIR and we were talking about partitions if a node is partitioned or is unavailable effectively the remainder of the system still has a majority of copies of the data and can therefore make an automated decision as to who is up and who is down which systems are authoritative for that consistent system whereas in eventually consistent environments it can be a little bit more complicated because effectively some of those decisions can be delegated to the application or some of those decisions could be delegated to reconciliation processes which are not perfect hence the reference to it doesn't mean eventual correctness. In terms of RTO both will react in a few seconds essentially depending on your health checks and the health checks that you set on the global load balancers but also some internal health check that the system has. In terms of latency consistent workloads have a strong sensitivity to latency between these failure domains that could be regions or data centers and by and large your latency will always be greater than true the worst latency between regions that you have multiplied by true because it's always back and forth so that sets your expectation and to me that says not all the use cases I cannot use strong consistency for all the use cases I will have use cases that need really fast responses where I can't use the system so I need to find different solutions but if it's acceptable if the latency that I have is acceptable like I said strong consistency keeps things predictable and simple for the developers. On the other side eventually consistent instead is not affected by latency because essentially the system first writes locally and returns and then it tries to synchronize with the rest of the instances so that does not affect the client latency that was a simplification but it's a way to explain why they are not really affected by their failure domain latency the throughput instead for both is can scale linearly as long as we have workloads that are touching all the partition so the requests are normally distributed and all the partitions are involved more or less in the same way this system scale horizontally sorry scale linearly with the number of instances so you want more throughput just add more instances increase the number of partitions and you get the throughput that you want and then as Alex said strong consistent as another constraint that some of our customers find taxing which is you need three failure domains in other words if the failure domain is the data center like we were saying you need three data centers and we perfectly know that some of our customers have some of the enterprises have two data centers and maybe in the same major area they would have very good latency between these data centers but they don't have a third one so what can they do at the same time instead of the central consistent workloads only need required to yeah so you know there are solutions to get the third data centers one option is to go to the cloud but that is certainly a constraint of strongly consistent system we wanted also to share reference architectures for Kubernetes deployment the first one that we share was very generic but obviously we are looking at Kubernetes with special attention here so not very different we still rely on the state workload to do the horizontal sync we still rely on persistent volumes provided by Kubernetes in this case and then we will have some ingress and a global load balancer in front of it that decides where the traffic goes so here when we lose one side essentially nothing happens the global load balancer should realize it and just send the traffic to the other ones another thing that you should ask yourself when you build this architecture is what happens if the clients can connect to my workloads but there is a network partition between some of the data centers and so one is isolated for strong consistent workloads this is essentially equivalent to losing this data center because all of these instances will become not active because they don't have quorum and so the global load balancers should realize that this side is not responding and send all the traffic to the remaining two so the behavior should be exactly the same we have also a reference architecture for events or consistency similar, right? except you just need two failure domain or two data centers here the conversation is slightly different when you lose well, again if you lose the entire data center the global load balancer can only send you to the remaining one and there is no real-state divergence because there are no rights on the data centers that you have lost different story when you when you lose connectivity between the state for workload but the clients can still write can still have a way to write in this case you can have divergence of the state and that's when we say that's the conversation we were having for this is this is a situation where at some point this fault will be corrected the connectivity will be reestablished the reconciliation algorithm will kick in but the result that you get as the final reconcile state is not necessarily aligned with what you need in your application okay we have some reference material here if you want to go a little bit deeper this is our white paper and some blog posts around building this architecture in practice and if we have time left we can explore one of these environments together so if you think you can do it in 10 minutes I'd love to I think we all would love to see you go into a demo environment but do you think 10 minutes is enough time to just be an exploration of what we have or do we have unless we have questions if we have questions I also would like to answer those questions I have a quick question going from what you were talking about last obviously one of the biggest considerations between strongly consistent and eventually consistent is the cost factor right so as you're working through this in tag storage are you looking at PCO or what do you have you explored that in tag storage overall I don't focus on cost because we try to be product agnostic so my only consideration is data center that may mean more that may be a significant cost depending on how you decide to implement that data center if you go to the cloud it's not actually a huge cost it depends on how much you deploy it's a pay as you go model but if you actually build a physical third data center data consideration is some companies may be running on software that is not capable of being deployed this way and so they may be facing a migration let's say you may have to migrate from MySQL to a new modern database that can actually be applied the way I was describing so this migration projects can create some cost in an enterprise I mean I kind of knew the answer but I was trying to tease out the strong consistency is obviously the best choice all right, dive in into your demo keep asking question I'm going to describe this environment so here in the meantime here I have a three clusters okay this represents my three regional data centers they are in different they are in Google and they are in different region as you can see you assist for your central and west here for example I have one of these clusters obviously I'm from Redats so I'm using OpenShift because it's easy for me but there's nothing in what we discuss that implies OpenShift you can do you can do everything with Kubernetes so here I have Kafka deployed this way you see three instances of Kafka but this is the second cluster three more instances these are all talking to each other and I have for example here a Kafka console in which I can see that I really have a if I go here I have a nine node Kafka cluster you can see from the name of the instances of this node that I have cluster three cluster one cluster two so all of the cluster OpenShift cluster all of the instances that are distributed in different OpenShift clusters come together to create a single logical Kafka instance if you notice this notation here let me make it a little bit bigger this is not the usual service name that you get inside of OpenShift this is a new standard for inter cluster of services and endpoints so it uses this cluster set notation and it also has the name of the cluster in which the pod is and so these are actually the pods that we were seeing before I think I have a Q topic defined here and if we look at this topic look at the partition it has nine partitions so it's well balanced on the available nodes of these clusters and it has I configure Kafka to be strongly consistent so each partition has three replicas and each of these replica is in a different region so if I lose a region one of these will become red but I still have two replicas and I can still continue working so that's how you can set up your workload in I also have in this experiment that I'm running I also have co-crustdb deployment so it gives you a UI where you can go and see what you have so here it has this nice feature where you can see your cluster how it distributes on a map so these are the three regions that I'm using from Google across North America these are the nodes also in this case I'm using I'm using nine nodes three in each regions and then each node within the region is in a different az so that I'm trying to get both local availability and global geographical availability and we can see another nice feature of co-crush that I like of co-crush is that it calculates the latency between the instances and then uses this information to do some internal optimization but as you can see this is obviously a symmetric matrix or mostly symmetric and from west to east is where we have the highest latency around 60 milliseconds so that tells us based on what I said before that latency that we can get from a transaction here once we have distributed the workload the data across the three regions the best latency is going to be around 120 something above 120 millisecond we'll get 120 milliseconds just from the network then there is processing, writing persisting and all of that so you can immediately start reasoning about what kind of workloads this database is so it's a trade-off your workload may go a little bit slower than if it was deployed in a single region but now you get that when you lose a region you essentially don't have to do anything the system will continue the service will continue to be up and the system will continue working that's it I think we don't have time to simulate a disaster and see how the system reacts we would see that here all of this system both Kafka and Cochrane can out-of-detect failures and they will start reacting to it we will see that we lose some nodes we will see that some of the ranges these are essentially the partitions are getting moved around and the system like I said we adjust to the new situation same thing does I think one of the key takeaways here is in a cloud-based world we now have the capability of implementing disaster recovery with different storage systems and different databases and different tools and in fact you get an order of magnitude better automation better flexibility than you do with your traditional systems and I think this is the next logical step for many enterprises as they adopt cloud-native technologies and Kubernetes and OpenShift and cloud-native storage solutions as they look to migrate more mature workloads and more mission-critical workloads that require disaster recovery so my key takeaway here is understand the different layers in your system understand the different attributes like the latency and the performance and consistency requirements of your applications and then yeah absolutely take advantage of the composability and the declarative nature of cloud-native disaster recovery and all the advantages that brings to your application and touch wood sleep better at night thank you thank you and we had a question come in that I think we'll have to take to if you came in from the LinkedIn event longer than a minute and it's a great question what considerations are required for cloud-native disaster recovery in a heterogeneous environment if either one of you want to take it in one minute otherwise we can push that to the LinkedIn chat to answer a quick go ahead Alex yeah I assume by heterogeneous we mean that we don't have an homogeneous cloud provider or infrastructure underneath right like I said this architecture relies on capabilities in the networking space so as long as we can do that it's West communication and we can discover the instances of our state workload on the remote failure domain and as long as we can set up some level of global balancing we will be able to create these architectures and in fact we are doing it and what collaborating with some of these vendors what we're noticing is how do you in order to be predictable in these deployments you would like all of the instances to behave the same but how do you for example provision the same IOPS across different cloud providers they all give this capability or this SLA in a different way or how do you provision the same computing power they are slightly different so those are the things that you may encounter but there is no actually blocker to building these architectures across the regions environment in fact and I'll just take one more second here I would kind of argue that tools like Kubernetes are actually designed to abstract your infrastructure and to give developers the capability of getting the same services from different systems and I think the glue that holds that together then is the east west networking and the load balancing services and things like that but that's it on top of that thank you both for taking that last minute question it's a very very good one and very important thank you again I will post the video recording and the slides at the LinkedIn group and then our LinkedIn event and then we will see you all next time next Tuesday we'll see if we have something scheduled but do you all then and thanks again Alex and Raphael this is a great important topic thanks for having us thanks for having us, thanks everyone bye bye