 So today we're going to talk about automated multi-region disaster recovery in Kubernetes for Postgres, the agenda, super high-level problem space and solution. I'm going to be a troublemaker talking about problem space and Shivani here is going to talk about the solution. So I'm Sergey, I'm from Precona. I'm doing products there. Shivani is a product leader allotle and Jan here as well is a software engineer at allotle. So we're going to talk a lot of fun things today. So problem space, the agenda for problem space is quick recap of why you would want to have disaster recovery. Then quick jump in into Postgres on Kubernetes and we're going to talk about disaster recovery for Postgres on Kubernetes and what problems it brings and then Shivani and Jan are going to talk about the solution. So disaster recovery, well everyone knows what it is I hope so it's a plan right how you want to protect your IT systems from data loss in case of databases or from business failures and so on right and you want to do it quickly and there are usually two drivers for disaster recovery. First one is business continuity which is basically boiling down to SLA requirements okay you need to recover your business within five minutes within 24 hours or whatever and the second driver is compliance and standards which is kind of dull and sad one because it's usually on the paper you might not really have a real disaster recovery but you sign it off your legal team signs it off everything is great but it's not real disaster recovery usually. So in a nutshell disaster recovery for databases looks like this you have site A where the database and you have application and developers everything is working everyone is happy business is happy now something goes south database fails or the whole site goes down the developers are unhappy application is not working businesses losing money what do you do you have site B which is a disaster recovery site and you recover your data everyone is happy right now so this is disaster recovery in a few seconds so a quick recap on Postgres and operators so again I hope everyone here is aware of databases and operators work with Kubernetes but in a nutshell operators for databases they provide you a simple way to upfuscate Kubernetes primitives and database configuration overall so you don't need to be an expert in a way of configuring databases or Kubernetes itself you just have a dream or you just need a database and you describe this database in a quite long YAML manifest and then the magic is happening on Kubernetes your database appears you can connect to it you can use it that's how operators work right and the more on it operators provide you not only with day one where you can deploy databases but they also give you a way to manage database day two operations back up scaling whatever you want this so this is the magic of Kubernetes operators and again there is an example of how per corner operator looks like you just specify a number of replicas you specify the configuration the version of postgres and that's it you have a working database up and running in seconds so if we talk about disaster recovery through per corner operators there are a couple of ways how you can do it so we have two Kubernetes clusters that can be different regions and different countries whatever and the basic or the simplest way to do disaster recovery is through backups so we have an operator running on site A in one Kubernetes cluster and you have an operator running inside B in another Kubernetes cluster and pg backrest that we use in our operator uploads your backups to some object storage can be gcs s3 whatever and then if you have a failure on your site A you can recover to site B by restoring from backups of course it is some optimal because well it might take some time and well if your SLA's allow you some companies some customers that we have they tell hey where okay for our if our database is down for 24 hours for some it works right so it can be one way and there is obviously a better way so you have again two Kubernetes clusters region A region B site A site B but instead of using backups what you can do is you can have a streaming replication that way all your data is synchronized between two Kubernetes clusters between two Postgres clusters in different regions live right and besides streaming replication what you can also do you can do this replication through an object storage where your backups are uploaded to the same S3 bucket or gcs or asia whatever and then write ahead logs are streamed directly to BG BG backrest in region B that way you have a real-time replication to region B so this is like how you can set up disaster recovery and that works great but what are the problems that you're going to face so automated failover is problem number one right so again this is our ideal setup you have database database is running developers applications are happy and you have disaster recovery in place replication is there now your site or region fails and what you do is you need to switch the traffic you need to switch the applications to start using the database in region B the way you do it right now within our operator is you need to do two steps first you need to tell the operator in region B that it is now primary and it's a simple change it's just a couple of lines in the manifest okay now it's primary great but also you need to switch the traffic for your application to this region that's another manual step and these steps they can be automated and you can write some scripts but it's not the job of the operator to do so it should be some third-party agent but that's just the beginning of the problem the next problem would be when your main region goes up so now you need to think okay how do I sync the data back because now all your new date is in region B in your disaster recovery so you need to synchronize the data back and well that is the real problem because you have a lot of manual steps now to perform and there are certain myths about disaster recovery and automation around it like myth number one is disaster is a rare thing clouds never fail or my data center is super solid I have backup power whatever well it's wrong you can look at how amazon regions failing asia regions gcp whatever and on-prem data center failed more often than you can imagine right and when this happens again depending on your SLAs you need to switch as fast as possible and that's the problem sometimes because we trust humans a lot right and when someone is doing something manually there is a lot of room for mistake and that is why the automation for disaster recovery is really really needed so as I said I'm going to introduce the problem and now Shivani is going to talk about the solution thank you sir I actually have my mic so yeah so you know we met Sergei a few months ago at actually kubecon in EU and he mentioned the problem to us we started collaborating with this team to find a solution and build a solution because the company I work at is in the business of building multi cluster control planes so we automatically have this sort of view and topology of multiple kubernetes clusters so this sort of problem is sort of you know in our alley in some sense so the way I'm going to describe the solution we built is first I will go over what a multi cluster control plane is what its core capabilities are that can be leveraged for building something like a DR automation solution then I we will go through what a DR orchestration workflow looks like and what additional things we had to build over and above our core control plane capabilities in order to automate this whole workflow that Sergei just described uh we I'll also share a demo which will show all this in action we've put together the demo with elotl nova which is the product my team works on and we use percuna postgres sequel operator to do all the deployment and and the setup of the two sites so yarn is my teammate he did all the engineering work and I will be talking about it I'm a product manager all right so what is a multi cluster control plane it is basically a kubernetes management cluster which has other workload clusters attached to it the control plane itself does not run any workloads those all run on the workload cluster so its main job the number one thing such a control plane does is to deploy workloads to one or more clusters and as a result of this because it's in the path of all the workloads being deployed it has this aggregate view of workload topologies which in turn enables it to orchestrate things about workloads spread across clusters so this is sort of like the secret power it has which enables it to do things like this there are a handful of products in this space there's karmada there's admiralty and of course at elotl nova there's a couple more red hat acm and cube fed I will be using elotl nova in the rest of the discussion because obviously that's what we are most familiar with so the way this sort of control plane schedules the workloads is it decouples the placement from the workload definition itself so you take your application manifest and you deploy it onto the control plane as though it were going to run there but in reality it doesn't run on the control plane what the control plane does it it has all these schedule policies which have been defined and you can have default policies too you don't have to define a policy per workload it has these schedule policies it tries to match the incoming application manifest with a schedule policy and then accordingly spreads the workload onto one or more clusters so this is essentially what the basic structure of a rescheduled policy looks like it it has a couple of ways to select the resources and then match them to one or more clusters so the namespace selector and the resource selector are ways to narrow down the resources and the cluster selector is to select the target clusters and if you do not specify a cluster selector that's fine NOAA will make a capacity-based decision to place it on any of the workload clusters that it's managing similarly if you specify more than one it'll again make a capacity-based decision so the next thing that you can specify in a schedule policy and this is actually very very core to being able to do something like a dr or a ha setup this is spread specification essentially what it says is take my workload definition and clone it onto all the selected workload clusters and this has two modes there is a divide mode which is more like you know splitting your workloads so say you deploy a replica set which has 10 replicas and you say you select two clusters and the percentage is you allocated a 50-50 it'll put five on one and five on the other then there is also duplicate mode which does exactly as it sounds like it takes your definition and puts it on all the plus the selected clusters so basically we're going to use duplicate mode for rdr setup so that way you know all your configuration secrets everything is the same across your sites it's never going to drift so that's something which is very helpful in maintaining a dr or ha kind of deployment okay so that was the basics of the control plane itself now let's look at what a dr workflow typically looks like I might have oversimplified it I'm sure I have but essentially these are the sort of stages that come to mind when people think of disaster recovery so obviously it begins with the setup of the database on multiple kubernetes clusters we'll assume we're all deploying on kubernetes and these clusters need to be in you know different cloud regions or maybe even better in different clouds or different centers if your data centers if you're on prem so the challenge I think I've already alluded to this so by now it's probably sort of obvious is to get the setup right if you try to do it manually it's it's a pretty error prone exercise you may not have the same s3 secret for your application bucket on both your primary and standby you may not have the same tls secrets your configs may not be same so using this sort of control plane for which you know we have the spread scheduling capability with duplication is is the answer to this first part the next is obviously data replication we're not talking about stateless workloads here we are talking about a state full workload like a database so your data has to be there on the other side a priori you cannot start accessing data from your primary site which has gone down because of the disaster so you need to have some form of data replication so you already described the couple of ways in which you can do this you know depending on your rto you can primarily go with the backup restore method you know if you want tighter rto you can rpo sorry I should say if you're on tighter rpo you can go with the streaming replication kind of method but this is basically postgres native technologies so which brings us to failure detection so let's say you know your disaster has happened how do you detect it how do you trigger the workflow to do the disaster recovery so this is very dependent on business requirements and what monitoring tools you use um how long you want to tolerate the failure before you actually trigger disaster recovery so you know we want to be able to do this part in a flexible way and same thing for failover you know every organization has their own runbook for the series of steps they want to follow when a disaster happens maybe it's not all automatic maybe you want to send an email to someone who has the authority to say yeah let's switch to the standby and you want to wait for their response so you know you can build in a whole sort of workflow in order to do this right so this needs to be flexible as well so we are going to do basically view these as pluggable components in our architecture I won't talk about failback because it's very similar to failover so you know we limit it to failover since this is a short talk uh so here is what we've come up with as our dr orchestration architecture right so we have the nova control plane which is the central scheduler as we have already looked at so it's going to take care of spreading the workload in an identical way on both the clusters next we have added a failure webhook which you integrate with your monitoring tools so you basically register it as an alert receiver we've also got a failover controller which is going to run a job which you will provide to it as a docker image and this job basically is the series of steps that capture or encapsulate your runbook so you can do whatever you want in that script or job that you register with the failover controller so it's basically these one two three steps right define your schedule policy register nova webhook with your monitoring tool and define this job for your actual failover all right so with that let's look at a demo the demo layout is going to look something like this I have four kubernetes clusters one of them is running the nova control plane there are three workload clusters registered with it these three are all in different AWS regions which is what you would want in reality uh the two of them cluster one and cluster two are running per kona database the first one in primary mode the second one in standby mode they're replicating data via the means of a s3 bucket and the third cluster is running an h a proxy which is how the clients are communicating with the database so the h a proxy either goes to one or two and the client is sort of insulated from this fact and seamlessly can be redirected to the correct cluster the failover job that I registered with nova control plane is as described by surga the minimal steps needed are to at least switch the workload cluster two from standby mode to primary mode as well as to reconfigure the h a proxy to go to the second cluster so now we're going to see the demo as I said this is a recorded demo so I'll try to keep pace with what's going on so we first got to get the kubernetes apologize for the font can anyone at the back see anything at all you can perfect so first we're going to get the kubernetes clusters that are registered as workload clusters so this command is going against the nova control plane that's a default cube context so there are three clusters now we are going to deploy the percona operator on cluster one and two so this is the operator manifest we've added a label to it to all our objects to say p cluster all this will be used for matching the schedule policy the schedule policy that will get matched will put it on cluster one and two so this is where the duplicate mechanism comes in so this will make sure both of them have the same same configuration same everything all right so let's look at the schedule policies we have we have four of them registered this particular manifest will match the first one which says pcql cluster all let's deploy the operator so the operator is getting deployed on the first two clusters now we are going to go ahead and deploy the s3 so we're just checking whether the operator was deployed on both of them all these commands that you're seeing are going against the nova control plane okay so it says two over one this is how we show when something is duplicated to multiple clusters so we're going to do the same thing with the s3 secret because we want to make sure that it's always the same on both this is also going to use the duplicate spread policy let's deploy the s3 secret okay so the secret is created now we're going to go ahead and create an actual database resource custom resource this will be picked by the operator which will launch all the database processes and do everything so this is going to use the secret for the backup that we just deployed right so let's go ahead and create the first database custom resource so we are applying the manifest cluster one this will put it on cluster the aws region one cluster let's see if it's coming up all right so it's initializing the operator is doing its thing now let's go ahead and create a second database okay so the only difference in this one from the first one that we created is that this has standby enabled to true that's how the operator determines that this one is the standby and also notice this is using the same secret resource so there's no scope for mismatch so let's go ahead and deploy the second database okay let's check what's going on with both of them so these two are using a different schedule policy because they are targeted only to one cluster each they are not using the duplicate spread one okay so we've got both the Percona database uh Postgres databases coming up let's now go ahead and start a client a pcquel client this is a simple client which is going to insert into some dummy table and then right after inserting it's going to do a count so when on the output terminal of this client you'll basically see a row count getting printed which is getting incremented if the client is working properly what we're going to do is simulate a region failure a disaster by removing the primary primary cluster so we'll kill the kubernetes cluster there the client will have a minor glitch which you will notice shortly in the meantime on the right what you're seeing is the actual recovery job which is now starting to run so what the recovery job will do is it'll switch the standby it'll change the manifest of database two to change the standby flag to false and then apply it and it'll also do the same with the primary just in case the primary comes back you don't want it to continue thinking it's the primary so just for an added safety we switch that to standby it also reconfigures the HA proxy so here's the client failing for a brief moment if we didn't have dr automated this client would continue failing and some until someone came and manually did something right but now with this whole thing automated as you see this client is back in business already so it's kind of like you know things just self-healed so um yeah so that's basically what the demo is about I mean the rest of it we're just checking the recovery job just to make sure if it's done completing and things like that all right so let me switch back from the demo so just some takeaways obviously to survive widespread outages you need to make sure your database is deployed in multiple clusters in different regions naturally use of kubernetes along with operators makes dr setup easier as well as opens up opportunities for automation and automation of recovery can be done in a simple low-friction way using a multicluster control plane such as nova this is how the future work we want to make everything doable we are manifest so we want to come up with crd-based definitions for failure detection and failover we want to make the control plane itself deployable in highly available mode so that it's not a single point of failure itself right um so that's basically it here are just some resources for you to learn more if you're interested um you know the link to the percona operators is there you can go do a free trial of nova we have the full featured version available for up to six clusters um thank you hopefully this was useful and you know feel free to provide us feedback okay great thanks everyone we have about a 15 minute break if anyone has questions for them um well we can take a few questions yeah sorry two questions probably one is uh rpo in case of i mean what's your how do we address rpo in in the case of automated failover between two regions and the other one was i'm curious to understand why you you are not using applications in the same local kubernetes cluster from what i can see you're actually using ha proxy and applications might be even in different regions so how do you address for example license in that case so these are the two questions um i guess um and sir get feel free to jump in for rpo you know if you want tighter rpo you probably want to use streaming replication and maybe you can add steps to your um failover routine before you actually switch to the standby to make sure that you've caught up to a certain uh checkpoint or something before you actually switch you're that if you're using back backups that your backup has caught up um but you know you can only recover data until the point of the failure right there is um if something was still in cache and didn't get written to disk i guess that's lost i don't know if that addresses your question yeah probably with streamer application probably synchronous application to the other region but in that case there's latency yes so that's why i think you know in any case these would provide probably the automated failover expects some data loss you know yeah it can happen yeah definitely so sometimes maybe organizations might prefer maybe to delay the failover and do it manually just avoid data that can happen as well and if that's an option basically yeah it can be an option well if you need the manual failover probably you would not use this one right because okay i don't i just gonna failover manually because i need to sync all the data make sure that data consists is near there but that one is useful for some other organizations that can be losing some of it like a couple of transactions or something there are also ways how you can address or minimize the lag obviously right and ensure that all the data is written but that might impact the performance for sure go ahead thank you very much for your presentation multi clusters is a thing that i'm stuck on so i'm interested in this i'm learning a lot here could you just for my education tell me is nova orchestrator is that actually a cluster in itself with just a control plane node or is it a docker container and the nova agents are those like stateful sets running in master nodes just just a sound bite on the architecture for how that works yeah i'll i'll try but my team might want to jump in too um so nova itself is a you know kubernetes api server with a bunch of controllers backing it that's a cluster it is a cluster okay but yeah you do not need to dedicate a whole cluster to it you could share it with other things and the nova agents are um additional controllers just running on the workload clusters they're running on the masters of each of the clusters is that how that works they they are running on the workload clusters i'll ask you later thank you very much it's very good uh if you can give some insight as to what kind of minimum uh rpo can be achieved with the nova controller so the rpo is really dependent on the underlying post-quest replication like i mentioned we uh nova itself is not doing the replication it'll just help you set it up by configuring your s3 bucket you're making sure both sides of the same tls or s3 secrets and stuff like that but the rpo the recovery point objective itself will depend on your application nova can help with rto the recovery time objective but it cannot help with rpo so yeah rpo is mostly on database you can achieve zero seconds but with certain sacrifices right but it depends a lot on your workloads on how you use the data how you connect it and so oh uh i wanted to ask you because the script you showed a sim rudder manual i mean that was a failover and i guess that you need to do some manual steps if you would like to failover back you know when you switch from primary to replica and then back something like this so maybe i wanted to ask you if there is some additional tooling to automate this process even further that you developed uh let's say to have some kind of experience like we do have on rds or something like this so we don't need to bother which one of this primary and it just can switch uh to the secondary and back without any hassle yeah so what we showed is just sort of like a simple workflow of a simplistic scenario and it can always be enhanced with more pieces to the workflow i from what i understand what you're saying is that you want to switch to the standby but once the primary is back up you want it to automatically switch back so yeah that can always be built this can always be enhanced to do that not yet we haven't built the failback yet but so between the two methods of replication you talked about the streaming replication and the object store replication do you have data on how fast it replicates let's say the primary is in the east region and the backup is in the west region how quickly it replicates between the two methods it depends mostly on your networking and the amounts of data right but uh again with streaming replication you can change almost like zero lag zero second lag so all the transaction would be seen simultaneously but that would imply some sacrifices again and before