 Okay, let's get started. Welcome everyone, thank you very much for joining us today. Welcome to today's CNCF webinar, Effective Disaster Recovery Strategies for Kubernetes. I'm Jerry Fallon and I will be moderating today's webinar. I'd like to welcome our presenter today, Rashid Amir, CEO at Sakateer AB. Just a few moments before we get started. During the webinar, you are not able to talk as an attendee. There's a Q&A box at the bottom of your screen. So please feel free to drop your questions in there and we'll get to as many as we can at the end. This is an official webinar of the CNCF and as such is subject to the CNCF Code of Conduct. Please do not add anything to the chat or questions that would be in violation of the Code of Conduct. Please be respectful of your fellow participants and presenters. Please also note that the recording on slides will be posted later today to the CNCF webinar page at cncf.io slash webinars. And with that, I'll hand it over to Rashid for today's presentation. Okay, thank you very much, Gary. Hi, everyone. I think some places I don't know what time it is. So I'm in Sweden, Stockholm right now and we are almost four o'clock here in the afternoon. So I welcome everyone for today's session about the effective disaster recovery strategies for Kubernetes. So at Sakateer, we have been lucky to work with Kubernetes since 2015. And we call ourselves been lucky in a sense that it's really awesome that I would say technology that we got a chance to start working on. And since then we have been working on different, I would say projects, applications, assignment with customers. And we have learned a lot about how, how you can actually have a disaster recovery for strategies. And we have learned the hard way we have failed and then try to figure out, okay, what are the possible strategies that one can have and what are the pros and cons? And in this presentation, I will walk through them one by one and present what we learned from them. Let's see, I hope you see my slate. So I am CEO and co-founder at Sakateer. And as I mentioned that I'm by heart a developer and found love in with Kubernetes. So Sakateer is basically just a quick life. We are CNC certified solution provider for Kubernetes. We are a Kubernetes enablement company and helping companies to realize the full potential of Kubernetes. It's in us, along with its ecosystem for starting from the strategy to develop and operations and help in all different dimensions within an organization to take how they can take full, complete benefit from Kubernetes. I don't think I need to talk about like Kubernetes as I think most people who are here are must be using Kubernetes. I don't need to say like definitely Kubernetes is the leading container orchestration platform. And given its beauty of extensibility, it's really awesome to work with. So talking about like today's challenges for high availability like disaster recovery. So what are the challenges for today? If you look in the past, as compared to past now things have changed quite a lot with respect to the high availability and disaster recovery. And if you look at like today's requirement from CIOs they want zero outages, they would like to react quickly to demand. They can want like to scale out. And of course they want 24-7 and 365 availability, which turn means that how to have that kind of requirements for implementation, you need to have the things like DAWOPS, CICD, ACR architectures, rolling upgrades, the Nancy and stuff like that. And then which of course leads into having a continuous service delivery. So the question is, what is business continuity? So business continuance of course is all about maintaining the critical business functions during and after disaster has occurred. The business continuance defines two main criteria, recovery point objective and the recovery time objective. So RPO amounts to how much data loss is tolerable and RTO how quickly services can be restored when the disaster occurs. And disaster recovery outlines the processes as well as technology for how an organization responds to a disaster. So a disaster recovery can be viewed as the implementation of RPO and RTO. And I will show later on like what exactly RPO and RTO are when a disaster happens. So business continuity is basically first you have a backup and then when it happens you're able to recover from that back. And then based on what RPO and RTOs you have. And then why is business continuity important? And so when your business depends on the time of your services and applications it is critical to have a comprehensive data protection and disaster recovery strategy. So regardless of whether the workload are running on premises in the cloud, disruptions can cost your business $1,000 each minute in lost revenue and decrease in worker productivity and damage to your brand loyalty and reputation. So disruptions can take any forms like user error misconfiguration or stuff like that. And but the question is, okay we wanna have a business continuity but then what are the challenges to implement that kind of business continuity? So you have to look into things like efficiency because VR requires a regular testing and the event of disaster resources must be available. This leads to sometimes you have either resources 99% of the time. Then of course complexity. So updating applications is complex enough but DR requires a complete redeployment where the DR side almost never mirrors production due to cost. So it could be complexity of the challenge. And of course cost DR usually is at least double surprise. So depending on how you are, are you doing stuff? There are these are the different challenges that you have to keep in mind along with your RPO and RTO to figure out okay what our DR strategy will be. So looking into like highly available architecture like okay what is high availability? So single side application is traditional where you don't have no disaster recovery that still works and you have an option. Second one is called the DR application where you can failover to your recovery site which is also traditional architecture as well that you have two sides and on failover you move to the other side. And then you have this option what is called multi-site application which is more cloud native. So you deploy across multiple sites disaster recovery is built in simply a rescale and of course it's recommended to have three sites versus as compared to having two sites. So that is like if you look into the high availability what it means and what it looks like. So the question comes is okay what is the angle expectations from our ID infrastructure that okay we have availability and we are up time for our apps. That's what we are expecting from our Kubernetes environment as well. So that everyone in the organization is super happy but eventually something may go wrong and this we have seen in practice that it does happen and things go wrong. And then of course you become a little unhappy and then this is a time for what we call the disaster recovery. So the disaster has happened and now we talk about okay what is the disaster recovery. So as I briefly touched upon the two metrics before the one is called recovery point objective and the second one is recovery time objective. So to put in a perspective if this is the line is the timeline and if you're taking backups let's say every 10 hours or five hours and if the disaster happens then you will have a downtime until you become green again and then it means that that is your recovery time objective. From the time the disaster happened when were you back again? We want to be back in 10 minutes, one hour or two hours, six hours that depends on the business as defined. What like for them? How much does the downtime cost them? And then the second thing is data loss. Of course that is no recovery point objective. Like how much data loss can you bear with? One hour, five minutes, 10 minutes. So these are like the different metrics that you have to keep in mind. And then based on these metrics actually you will define your DR strategy. So whatever options I will present afterwards you have to always keep in mind these two values. So if you go back to businesses and say we need a DR strategy so you will have to start your discussion with these two matrices. Because once the business, the C-level executives or whatever, whoever they are like the CN makers once they define these metrics then you can propose them, okay what kind of DR strategy you should have that will actually meet your these objectives. Of course now you have solutions available when you have live replication is happening and the data loss is almost zero or like it's very minimal. You can have almost zero downtime. So all of the things are possible. But again, that depends what are the metrics that the customer has defined for you. They want to have. So if you talk about before the container era things were much, much simpler because you had a one-to-one mapping relationship between application servers and the applications. So what you really do or you could do was let's take a backup of everything in my server. I have a VM, let's take a snapshot. And then if failure happens I will just start a new server or a new application with the same person machine or snapshot that I have, not perfectly fine. So that was things were pretty, pretty easy. Of course, before we then into containers started doing microservices, stuff like that. So what are the traditional DR approaches that you might have seen out still being used as well? So, I mean, approach number one is you can have a cold standby. On the left side you have an active site on the right side you have a standby site or you can have a warm standby. So the difference between the two is the replication. So if you have a cold standby then you are taking periodic backups and restoring them. And if you have a warm standby it means more like a batch or a continuous replication is happening. So you're moving stuff from one to the other you're copying over stuff with some way. So application and data are unavailable until the standby is brought online. Data loss, highly likely in this scenario because you're either taking periodic backups or you're taking a batch or maybe continuous but again, there'll be some batching going on. The approach number two that you have is again similar to that one. So you have a continuous replication in this case you call hot standby. And then the other option is hot standby with read replicas. So in this case, you are have a columnist replication going on and you have a hot standby and then you can also have a hot standby but with read replicas. So the right is only on the active one but read can be done on the standby. So that's another strategy that you can have. And of course the third strategy the third approach is actually that you have you're relying on the underlying technologies that will take to help you out with the real-time data integration and replication. So then you can rely on if you're using any of those kind of technologies then of course you can rely on that as well to build up your disaster recovery solution. In that case, you will have of course minimal data losses and stuff like that because you're relying on the underlying your VM solution or your storage solution to actually help you replicate your data and help you replicate your virtual machines if in case the disaster happens. So these are kind of the traditional approaches. So when we're going into the container era these we will just we will borrow from these traditional approaches which are still being used underlying in our if you go to cloud environments when you buy them underlying they are still using the similar set of technologies but it's more like to know what are the basics or different ways of doing it. So when we enter into the containerization world and given if you're not familiar with Kubernetes Kubernetes has changed the way you deploy applications because you don't have no relationship between a fixed server and application. It's more like you have a node and then your application will run on any node. So we don't know where your application will run given any of your cluster. And Kubernetes is the orchestration platform you can run terms of I mean, give it application it figures out for you which node has available has a capacity to run your workload. So you don't have to bother about okay thing, okay run here or run there it's like it picks up and runs. Of course you can have node pools and those things like to say okay just run on this kind of nodes that is possible but it's just like there's no more one-to-one mapping anymore. So you have even more dynamic architecture now as compared to pre-container error. So then comes down to Kubernetes backup and recovery. So why backup? Of course you could have a natural or man-made disaster it could be human error it could be hacker thingy malware or of course you could have legal standards or compliance regulations that you need to make backup. So even if you're using Kubernetes you need to think about your backup and recovery strategy, your DR strategy and reasons could be a lot. And then the question comes okay what do I need to backup when I'm running my Kubernetes? So it has a list of the things that at least you need to backup. So your namespace configurations your Docker images if you're not running outside somewhere your deployment configurations and things like everything around that I can pick map routes anything that's related to your applications that you are deploying to your cluster your operators or everything anything that's deploying you need to have a backup of that. So, and then of course your access controls how people access the cluster and stuff like that if you have our backs and that the policies defined they need to be backed up. Then of course another important thing is certificates that are being used for your routers and stuff like that that needs to be properly backed up as well. And then of course the last key is the persistent volumes if you're running stateful applications within your Kubernetes cluster. So you have to backup all of this stuff and if you look into this data you can divide into three things basically the volumes, the data, your like configurations what I call static configurations like certificates and similar things. And then of course the things that you put inside the Kubernetes cluster like deployments and manifests including namespaces, deployments, config maps or secrets or whatever. So if you look into that you can we can divide our like Kubernetes that it has two sort of components. One is what we call stateful components and the other is what we call stateless components. So when we talk about stateful components you have two things. One is it city if this is where the Kubernetes cluster is stored, it's the brain of Kubernetes. And of course it holds all the application configurations and the second thing that you have is actually the persistent volumes. So if you are running stateful workloads then you need to have a persistent you will have persistent volumes. So you need to think about these two components when you are thinking about your DR strategy. How am I going to keep a backup of my ECD and how am I going to keep a backup of my persistent volumes? So they are the two things that you have to always answer. And then of course you have a lot of stateless components which are like rest of the Kubernetes control plane and the worker node components and the stateless workloads. So you have lots of them as well. But again, you don't have to bother at all about the stateless components because you already have what's running on your Kubernetes cluster is inside your ECD. So you don't have to worry about like keeping if you have a backup of ECD, you can restore from that one. But if you have a data and volumes, then you have to really think about so stateful components are where you have to focus on a lot and make new real strategy. So just a brief into what's ECD how to backup ECD and how to restore ECD. So ECD is pretty quick is what to say if the key value store, it's a brain, it's database of Kubernetes, that's everything is stored. It's consistent and highly available stores all the cluster, state and data, all Kubernetes objects in the cluster. So if you look into Kubernetes, what is Kubernetes? Kubernetes is a really awesome, well-defined API. So it's in a very well-defined REST API where you can just go and interact with and ask him, okay, create me deployment, create me a config map to leave this, do this. So it's really well-defined set of REST API endpoints or proper entities, which are being maintained over APIs. So, and if you look at how to backup ECD, so you have multiple options that you can look into how to backup ECD. So one option is you can use a built-in snapshot feature of ECD to take backups. And that's like, that creates a snapshot file. The second thing is you can, of course, you can take the snapshot of the storage volume. So if you're running it in the cloud or locally, you can just take the volume snapshot as well. And the third option is that you backup the Kubernetes object and resources. So that is the third option that you can just take a backup, export all of the APIs objects and put them out somewhere, export, and then later on you can import them back. So there are like three major options of backing up an ECD cluster. And of course, depending on how you actually wanna do your recovery, then you can follow the documentation of the distribution that you use, how they suggest that you recover your ECD column. So for different, I would say, distributions of Kubernetes, you have to have different documentation procedure how to recover. So backup is one thing. Then recovery, you have to follow some manual steps to recover your ECD column so that your control plane comes back operational. So just to put, just to explain, like if you have a running cluster and if your control plane becomes unresponsive or becomes like goes down, it will not affect your running applications as such because they will be still be accessible, but you will not be able to create new applications or deploy a new stuff. And then you have to restore ECD form to bring things back. So restore ECD, as I mentioned, you can restore from the snapshot or you can restore from the volume or if you're using third option, which was if you have backup of the, then now you can also recover that way as well. Now the second thing that were the stateful components we talked about was persistent volumes. In persistent volumes, you can just put them into two like CSI, continuous storage interface, base volumes and non-CSI volumes. CSI volumes are not that new, but yeah, bit new as compared to non-CSI volumes. And if you're using a CSI volume, then creating a snapshot or persistent volume is just creating another manifest. And that's it. Then you can create, you can say, okay, create a snapshot of my from a volume. So it will create a snapshot for you. And then of course you can define your strategy how often should it take the snapshot and stuff like that. And then, but of course you have to keep in mind you have to check the CSI driver that you are using does the volume snapshot functionality is implemented? I mean, does it have really underneath a volume snapshot functionality? If it does, then CSI based volume is a really good way. Like it's easier to define, because again, you can put your snapshot strategy as code as well. If I take my snapshot is as code as well. And restore is as simple that you start a new, you create a new persistent volume claim and you say, okay, data source is my snapshot. And then you can restore it from that. So, and then if you have non-CSI based volumes, then you have to look at the vendor. Vendor, like if you have like non-CSI based volumes if you're using, you know, then you have to look into vendor specific backup restores or you can look into the open source solutions that are available if you want to take a look at them. So that doesn't mean if you don't have CSI based volumes you cannot do backup and recovery for volumes you can still do that. And there are some open source solutions. Otherwise you can use the underlying solution that you're available from the, whatever technology you are using in your enterprise, be it like data storage technology or whatever technology you're using they have, they have options for backup and restore. So if you're looking at public clouds they have built-in snapshot capabilities. You can use those snapshot capabilities as well for your volumes. Otherwise if you have CSI volumes then as I mentioned, you can use that. Otherwise you have open source solutions. So now we have understood that we have to backup its CD and then we have to backup our volumes. Now we will talk about the different, like restore your strategies. What are the restore strategies that we have for our platform operations? What strategies can we define into? So what are those strategies? So enough, how we have learned in our experience that you can define into two kind of strategies. One is what you can say rebuild, like you're rebuilding and the other option is repair. So either you can repair your cluster or you can rebuild your cluster. I'm gonna rebuild my cluster or I'm gonna repair your cluster. And I think again, you would have to go back and look into your RPO and RTO to define and to figure out or decide which strategy you would go with. So you can have what we call platform backup restore. You can restore VMs from a snapshot. You can fail over to another cluster. You can fail over to other site and then you can rebuild from scratch, get off. So I mean, in our experience we have done these kind of different strategies and learn from them. Each will have kind of have their own pros and cons and some are of course, are better than the other. But again, that all depends on your RPO and RTO and how much money you have on a spend on your backup recovery strategy, like your DR strategy. So as by the name, what is the first strategy? Platform backup and restore. If you're on the left side, you have your source cluster on the right side, you have your target cluster and then you're running some backup tool which is actually taking a backup of your applications like it's a D basically and your persistent volume and your images. And it is storing that information at some, in some repository. So everything has been backed up and put up there. And then you have your target cluster where you can use the same backup tool to restore from that stuff. So you can use some backup tool. For example, later could be the one that's an open source or you can use others out there as well. But backup your cluster resources and volumes for the entire cluster or you can also backup a part of the cluster. So you can also choose to schedule this backup. So this could be a good catch all solution to backup a large part of the entire cluster. Of course you will require adequate storage in your replication repository where you're putting up the stuff. And when it's needed, you can restore the entire cluster to it's working state in a new instance. You'll start a new instance of your city. Oh, so you start a new cluster and then you'll restore everything from your backup to things that you have back there. So that's like one strategy that you can use. And it's used quite commonly as well. Strategy number two is the restoring the VMs from the snapshot. And this strategy is of course only for, we're talking about your master, your ETCD. Then we're talking about the control plane recovery. So if you are taking a volume backup of the ETCD and then you can restore the new cluster from that from those volume snapshots and then of course add new worker nodes. For the volume backups, you can follow the same strategy but again, in that case, you will have to back up your volumes and bring them back on your new cluster. So you could have, if the underlying infrastructure has built-in support for volume snapshots, then you will just need to recover an ETCD core or from your snapshots and then bring the cluster back. And again, as I mentioned, in our experience of working with different Kubernetes distributions, everyone has their own way or I would say to how you recover your ETCD core have their own like manual procedures to recover stuff. So in that case, you will have to follow that procedure to recover your cluster to bring the control plane back. The third strategy is basically is to failover to another site. So what is a failover? So in the case of failover of one cluster, you use a failover cluster. So both clusters are nearly identical infrastructure identical, stateless application identical. Configuration could be different but clusters are kept synced with the parallel CI CD. So this is like the setup where you have two data centers, you're running two clusters, two could be running like two clusters and you are deploying to both but one of them will stand by. Other one is like running active but in this case, CI CD can solve your deployment part but for data application, you have to rely on the underlying technology if you're running stateful applications. Then you have to rely on your underlying technology to ensure that the data is replicated from the active side to the standby side as well. So in case active goes out of service, standby can take it up and if you have stateful applications, it should see the same volume or data around there. So that's another strategy. Again, if you look at your RP and RTO, then this one, you look at, I mean, then this one will definitely, you have to consider from the cost perspective that it's like doubling your cost. You have to run two clusters in parallel and you have to maintain two clusters in parallel and stuff like that. So that's one strategy as well. Then another strategy that is very common now is actually this line that you usually see in your public cloud setup, that you have one cluster which is stretched over multiple sites and you can do same thing on for on-prem clusters as well. So you have a cluster which is running, spanning across multiple data centers. And in that case, what you try to do is it's always recommended that you run not odd, sorry, not even with odd number of sites. So not two sites, but at least have three sites because ECD needs like Cormap, it's Cormap three and it's better to have three sites. So if you lose one site, still the major of the Cormap will be available and your cluster will be operational. So in this case, what you try to do is that you try to build up a cluster where you have multiple sites. So for example, let's say you're doing it on-prem and then you will have multiple on-prem sites where you're trying to replicate the data. And in our experience like this one, this strategy works really well as well. And what you have to keep in mind is that you need to have enough capacity on the other sites if one of the sites goes down. So that is then this goes back to your capacity planning. But it is of course better than having two separate clusters. It's better than that, that you're having then two separate clusters. One is active and one is not active. So in this case, the whole cluster is stretched over multiple sites, like let's say stretched over two or three sites. And in that case, you're also relying on the underlying infrastructure. So that is very important. So you're relying on the underlying infrastructure to replicate your volumes and you replicate your all data. So in that, if you lose one site, then you would expect your VMware technology and your, that's not VMware technology, not VMware, sorry, VMware technology, whatever you're using and your data storage technology, whatever you're using to ensure that the data is being replicated to all the sites and virtual machines can also move when one site goes down. So we have done experiment with this setup with customers to where they had to test out their VR strategies, where you take out the one of the sites and the underlying technology that you're using that moves stuff quite seamlessly. So you don't see any difference like, okay, I lost a site, but since the other site had enough of the capacity, things were kept running. And, but one very important thing, if you are running a multi-site with two, let's say two sites, then things can become bit trickier because if you lose the site, which is running two masters, then your rest of the cluster will become read-only because each CD will have no leader left and it won't be able to entertain any new requests. And it usually is not able to even reload or say load your, deploy your stuff back again, unless you have underlying technology that will bring back those, the two masters in the broken site to your other site. So because as long as your IP and host name, they stay the same, Kubernetes will not notice that this master has moved from one place to another place because it's seamless. So you can definitely have an underlying VM technology that is virtualized over multiple sites that can handle that kind of replication for you. So you don't have to bother. And similarly for data storage, you have to keep be aware of that. So we have used this strategy quite a lot and it works pretty well. And now we have been recently helping out or doing some stuff where we have two on-prem sites and a third site, the third site is cloud. So that's quite common scenario as well. So that you add a third site here where the third site is your cloud. So you have two public, sorry, two on-prem sites and the third is a site on a cloud. And then you can scale out on the cloud that you need on-demand. That's quite common pattern or question coming out these days where customers don't want to buy more hardware but rather have a cloud as a backup when they have a scale out demand. And with Kubernetes, it works out pretty nicely. Really good way of expanding into cloud. And then it comes back to the last strategy that's called rebuild strategy. And this is a really nice strategy. I'm pretty sure if you are familiar with Kubernetes or stuff like that, you might have heard about GitOps kind of getting very famous these days. The idea is that your Git is a single sort of strut for your decorative infrastructure. Every entire system is described declaratively self. Everything is like in a Git, you can see stuff in the Git like, like no, Git diffs, you can roll back and you can do all that stuff, which is pretty nice. So what the idea with this is that rather than doing, rather than what you say repairing, why not rebuild the cluster? So that works pretty well as well. So if you have a complete automation, rebuilding is something faster than actually repairing a cluster. But again, you will have to keep in mind your RPO, your RTO and your persistent volumes. So when you were talking about these kinds of stuff like GitOps, the whole idea is that this cluster state is in a Git wrapper. So you don't need to take an HDD backup anymore because whatever you deploy to the cluster is coming from a Git wrapper. So it means if I lose a cluster completely, I don't have to worry about because I can bring up a new cluster and I can apply the same Git wrapper there and it will have exactly the same number of things running all the time. Well, I mean, the next time. But again, for stateless applications, it's gonna work perfectly awesomely and you can do it very quickly, like RPO and RTO is amazing. But when you're talking about combining this with persistent data because now with the help of operators, you will see that more and more workloads are running on containers like even the databases and matching queues are running on containers. Then you have to combine this with your storage backup recovery strategy as well. But of course, you can think about of having your HDD stuff or in a GitOps wrapper. But then you need to have a mentality, what we call everything as code. So your infrastructure has to be as code, which includes cloud configurations, VM configurations, even the Kubernetes cluster setup. Sorry, and your deployments, your tools, applications, configurations. So everything has to be in code. So when you talk about cloud infrastructure as code, so basically you have a pipeline and that pipeline basically first brings up your cloud infrastructure, like compute, storage and security. Then you have next, that brings up your Kubernetes cluster on the cloud that another step in the pipeline. You have a third step which deploys your, what we call sporting tools, like maybe premier tiers or whatever you're using, EFK, that brings up all those into your cluster. And the last thing, you deploy your applications. We have repo. So everything is inside the code and I can reproduce my whole cluster from a code. And as I mentioned, this will work perfectly fine as long as I have state less things. But when for stateful applications, you can have a new cluster, everything there and then you can just restore your volumes as well for all the stateful applications you're running. And we have done this, experimented with this and used this and this strategy also works pretty nicely. But of course, you have to be a bit more prepared in that, that how you can run this. So if I try to summarize what we looked into, I think one thing that is very important is to plan ahead. So review your disaster recovery requirements and what is the tolerance level of data and service loss that you can bear, what should be your RPO and RTO. So just make sure that you plan ahead of it. And then once you have those matrices, then you look into your workload. Okay, what workload am I running? Am I just running stateless workloads? Okay, yeah, I'm running only stateless workload. Then I can follow this strategy. Now I'm running a stateful workloads with also, like I'm running not only stateless but also stateful workloads. Then I have to think about where am I running my workload? Is it on prem? What underlying technologies am I using? How do they support me in the backup of my data? And then you plan that stuff accordingly. And then you should create backup and recovery procedures to best meet your specific requirements. So look at your requirements and then you can look from these different strategies that I mentioned, that which one will fit best for your needs. And this is very important. Don't just plan and document. Ensure that you also practice this in real. And this practice should not be like one time. This practice has to be continuous. Like you should think it in a way where, for example, what we do with our own environments where we don't manage OpenShift or Kubernetes for customers, we actually do a lot of times. We like, for example, if they are like DevTest environments or DevTest clusters, we actually destroy them every night. Let's delete them every night. And then we have procedure in place that next day in the morning, you have a new cluster, everything is brought back to the same state. And so you have a continuous testing that's been happening. It's not like once in a blue moon that you test because then your procedures will get very quickly outdated because a new technology was introduced and then no one looked into that and then now when you try it, things are failing. So it has to be a continuous part of your ongoing day-to-day job, let's say workflow. That's very important. And you can do this kind of stuff in your DevTest environment. So if you have multiple clusters and usually you have multiple clusters. One is DevTest, one is production. So, and if they are on same Kubernetes version, you can test everything on DevTest like continuously. You can verify. Then of course you can look into what we call Kiosk engineering tools like Kiosk monkey and those kind of stuff. So what we do a lot is basically if you're running a public clouds, we use what we call spot instances. So spot instances is a built-in Kiosk monkey for us because you lose a spot instance and things have to recover back. So it's like you're continuously having, I would say a built-in testing is happening. Of course we don't run master's on spot instances, only the worker machine nodes we try to run on spot instances. And then you can test it like built-in that you continuously test it. And then it's very important that you should consider failures on every layer. Remember I mentioned that I can, like Kubernetes disaster recovery strategy might rely on your underlying infrastructure, your VM technology and your data storage technology. So what if they fail? So you have to always think and look into all the possible layers that you have in your infrastructure and then define your DR strategy because I mean your storage layer can fail. It might fail to do the backup. It might fail to restore, like help you in store, restore. Similarly, things can happen on your VM layer like your virtual machines can fail and might not be able to recover. So just look into that very thoroughly and test as well. So like master's, workers and of course storage. So now I'm done with the presentation and I would say thank you very much everyone for listening to me today. And we are very passionate about Kubernetes so I hope you have questions. I see there are a few questions out there already. Let's see if I can answer them or not but I will try my best. So very over to you. Yeah, thank you very much Rashid for that wonderful presentation. We do have a few questions in here already but we will have about 10 minutes for Q and A so if anyone else has any questions please feel free to drop them in. First question, if I'm running my K8s cluster in the cloud, let's say AWS and I take a snapshot of the ETCD volumes mounted to the master VMs as the internal IPs are hard-coded into the ETCD data volumes. Would I be able to restore the cluster state in a new built cluster? This cluster will have of course a new set of internal IPs. Yeah, right. So I think in that case it's like it's not ETCD volumes. I would prefer in that case snapshots of the ETCD, the built in ETCD snapshots and not the snapshots of the volume but of course if you have snapshots of the volume and if you have these things like IPs and host names, what to say, being decoded into the setup then it doesn't work, you know because as I mentioned, if you wanna recover and you need ETCD to recover from a state it needs and where you have lost a few members of the quorum and not everything else then they need to have same host name and IP. If they don't have same host and IP, you can still recover but then you have to do a lot of manual work in between to remove nodes from your ETCD cluster and add back. But in your case, if you're running on a cloud for example AWS, then I would rather go for like having a snapshot for the ETCD itself not the volumes because that will be much faster for it to recover from the ETCD snapshots. That's how I would do, you know. Okay. What about secrets in GitOps approach? Yeah, it's another a billion dollar question. And so I mean, we again, here you have multiple options like if you're using GitOps approach how you handle secrets. So we are using different approaches here. One of the solutions that we use quite a lot is seal secrets. It's quite okay, it works. But again, in that case, you have to like seal your secret for each cluster depending on that's how it works. But we have also using the HashiCrop vault open source one and that also works pretty nicely. And then we use things like go pass to put up the shared secrets if the developers need to know those secrets. But we never recommend to put up the plain secrets to need our secrets out into your GitOps repo. That's never recommended and you should never do that. So both, if you want to have a lean start then seal secret is a good option. It works pretty nicely, fits very nicely into your out to GitOps workflow. But again, that really doesn't solve the problem that is my secret secure now. No, seal secrets doesn't secure your secret because still I can get into my cluster and I can read the secret. So if you're really conscious about your secrets then you have to think about a solution that gives you secrets on the fly and puts it in a location that you can't get in and read from as a human user. So I hope I answered the question. What are your DR recommendations for stateful applications, databases and messaging systems running on K8's worker nodes? Okay, for example, I'm reading, if we did live in public cloud, I'm trying to put out the, would you recommend multiple these reasons and reason or do it across reason, my concern is FDLD. So I think if I'm running a public cloud, I would, in a region I will go multiple availability zones. I will run my cluster in multiple availability zones. So that's one thing that I would do. I mean, I don't need to go to multi region because multiple availability zones are perfectly fine in that case. And if you are actually running, as I mentioned, stateful workloads, you can use, look into the open source tools like for example, Villero. It's pretty nice open source tool that can take snapshots of your volumes. And then you can use those snapshots to recover your system, your stateful applications. So I think that that is a really good way for you to recover for your stateful application. And that's what we have been using quite a lot as well. You can also look into the volume snapshots of what that you get from the cloud provider. But again, I think it would definitely depend on your RPO and RTO, but which of the options you can go with because if you are taking snapshots from a tool like Villero, then you have to remember that you have to recover it manually kind of when you recover that. So it means you have a downtime. Otherwise, you might wanna go with other technologies. You might have to look into some proprietary solutions that are replicating your data across multiple availability zones. Where should secrets be stored and in what form of GitOps? Where are secrets in store and in what forms? I think I answered the four. Look into sealed secrets. For GitOps approach, sealed secrets works perfectly fine. And you will like the workflow. So we put like in a GitOps wrapper as a sealed secret. But as I mentioned, if you wanna go really, you are conscious about your secrets, then you need to look into other options where you don't put your secrets in a Git, you don't put your secrets in a Git wrapper, but rather in some other tool that takes care of your secrets handling. But if you have, let's say, secrets which are pretty commonly shared by developers, then I will suggest you to look in something open source called GoPass. So that's very nice tool. So that's how we share. Within a team, we use GoPass to share the common passwords. Because in GoPass, you need to have multiple signers to be able to get access and read a secret. So yeah, so that's all the way. But that's not like a part of GitOps strategy. It's more about we use this third party system called ABC. And I would like, we as a developer team would like to know what is the password. How do we share the password between ourself in a secure way, rather than putting in a text file or sending or Slack. GoPass is really nice. So look into that. We are planning to go to production with EKS. Can we have a small summary about the checklist that is needed to be passed before this transition for HA scenario for stateless services? I think if you have stateless services and you're going with EKS, your best strategy is go with GitOps on day one and then you will have your DR built in actually. And we would do it, I would let's say I'm going with EKS and I have two clusters. One is DevTest, one is production. I would just have two separate. I will have a GitOps repo and I will have everything as code. I will delete the, well, I will, but we do that as well with ourself as well. We delete the DevTest cluster every night and over the weekends. That also saves us cost. And then we have tools, the pipelines that bring the cluster back and then the same GitOps repo is applied back. So when you're looking into GitOps, you will look into the CNCF, I think, both projects, Flux and Argo CD. We have used both of them. Flux is more like very lean start, like because it does nothing more than cube-cuddle-apply. But if you wanna go a bit fancier with nice UI, you can look into Argo CD and you can deploy both of them as code as well. So, which makes it very easy. So I bring up a new cluster, I will deploy Flux, tell him which repo to watch and bang, I have a complete, my DR strategy built in, you know. So for stateless applications, in our experience, if you have only status services, GitOps is the best strategy for your DR. Don't look into anything else. Much faster. Bring, I mean, rebuild, not repair. Just kill the cluster. I mean, we have done that. We just kill the clusters and bring new things back. And sometimes it's much faster than repairing the cluster because for repairing the cluster, you will go and read documentation. It will not work. You will read next time, it will not work. So just rebuild and it's much faster. Okay. We are using Valero. What was the question? Next question. So what do you recommend to start with Argo versus? I mean, if you're starting Flux, it's much better, very lean, simpler. It's not that I like it more. It's more like it's, it does not introduce you any new concepts because as I mentioned, Flux is only doing Qubes, it'll apply. So it's much faster. But when you go into Argo CD, then you have a bit of a more learning curve because you have to first learn because it introduces some new concepts like an application and those kind of repository. So you have to learn bit more things. But if you wanna do GitOps, you are just starting with, then Flux will take you literally very far away. And a lot of how we do, we do Flux for the, we do Flux for the infrastructure. And then we use Argo CD for applications. So we do a lot of that combination as well where we are using both Flux for deploying infrastructure. When I say deploying Prometheus and those tools and Argo CD when deploying the developer applications. What about Spinnaker? Oh, never, we have never used it. I think it's catching up, but I don't know. I haven't, they haven't got any, we haven't got yet any reason to try out, I would say. So maybe we'll, when we have a use case. And final question for all the strategies, how is the load balancer configured? Meaning, are there any additional load balancers required in any of the strategies mentioned in this webinar? Yeah, I mean, it's again a good question. The load balancers are usually considered like outside of your, because you have your DNS entries pointing to your load balancers. And then from load balancers, the traffic gets into your clusters. So when I talked about the strategies, the load balancer is running outside of the picture. It's not part of the picture, it's outside. And that does not get affected as long as you have same DNS entries and it can point to the same, what should say the load balancer IP. But of course, if you're bringing up a new cluster completely, then you will have to go and change your load balancer setting to point to the new IP of your load balancer. So again, I think it will depend on which strategy you are going for, you're working with, then it will affect your, I would say, RPU and RTO in that sense. Well, thank you very much Rashid, again for the wonderful presentation and for the Q&A as well. That is all the time we have for today. As I said before, today's webinar will be on the CNCF webinar page at CNCF.io slash webinars. Thank you again for everyone, for your time today and thank you to Rashid as well. Everyone take care, stay safe and we will see you all next time. Thank you very much everyone. Nice to have all of you here. Bye bye, bye bye, bye bye.