 A very good morning everyone. Thank you so much for choosing to be here. I know you had a number of parallel tracks to choose from. We are excited to share our work today. And I hope you'll learn something from our talk. If you don't, we have a money back policy. Come talk to me at the end of the talk. Okay, so. Okay, I'm selvi Kadenville. I am the engineering lead at a startup called a lot. We build Kubernetes management products. We two of them one is called Luna. It is an intelligent cluster auto scalar. The other product is Nova a multi cluster scheduler and orchestrator. I've been in the container communities management space since 2015 at another startup called container X and at Cisco previously I was working on using machine learning techniques for for info management at VMware and as part of my PhD pieces. I'll let Aman introduce himself. I am one. I work at you go by DB as a software engineer. I work on primarily work on intersection of Kubernetes database management plan and database prior to you by DB I worked at in the virtualization and infrastructure space at Newtonics and VMware. Okay, so we'll start by describing the problem we targeted to solve by passing through our long title zero touch fault tolerance for cloud native geo distributed databases. We'll talk about two different components. The distributed SQL database specifically yoga by DB and what a multi cluster orchestrator is and how they work well together to provide us with zero touch fault tolerance. We'll end with the demo. Okay, so what is your distribution simply defined it is when your database is spread across two or more distinct geographical locations. And it is done in such a way that it's capable of operating without degraded transaction performance. And why do we need it. Top three reasons are typically we want our businesses to run highly responsive services. One way to do that is to move your user data close to where your end users are. We also want our businesses to comply to sovereignty regulations. And most importantly, we want to be resilient to a wide variety of failures that the complex software and hardware stacks that we operate in our businesses. Now, what is a cloud native database, it primarily serves the use case of modern cloud native applications, which have three important requirements, the database needs to be able to scale. We want to be able to deploy it on clouds on premises, Kubernetes virtual environments or bare metal. Once again, we need it to be resilient to failures. It is this requirement of resilience to failures that brings our need for zero touch fault tolerance. So for the past decade, as soon as we started running our operations on public and private clouds, we've had to deal with a wide variety of failures. Despite that, you know, the amount of dollars associated with it downtime has only continued to skyrocket. This is an example of from a Gartner survey from a few years ago. Data center downtime can cost companies between $140,000 to $540,000 per hour. And this is only for verticals that does not include critical services like banking manufacturing or healthcare where your losses could run up to a few million dollars per hour. Typically, we'd associate this with lost revenue or missed SLA financial penalties. But in addition, there's a number of other factors such as lost productivity, your teams are firefighting rather than actually adding to your business logic. There's brand reputation loss, there's customer churn, and there's employee retention. If I were an employee and my page of duty calls were exploding, I am going to look elsewhere to more mature stable environments. Now, we categorize who is responsible for providing such zero touch fault tolerance, right? Is it your application architect, your application ops, your database architect or your database admin or your infra teams. We categorize failures into three categories. Those that can be handled by inherent resilience within your cloud native DB. This includes storage failures, network partitions and other software failures. There is the category of node and zone failures where Kubernetes as your orchestrator within a single cluster helps solve. So you have your typical pod controllers that can bring up pods on new nodes. You have zone failures that can be handled by topology spread deployments and using multiple availability zones and node groups associated with them. It is when your fault domain becomes regional failures or cluster level failures that the combination of a cluster orchestrator along with the resilient database comes in handy before we go into talk about how we do this. Aman will talk to us about you go by it. Yeah, so you might as a transactional SQL distributed database that is designed for resilient scale and global data distribution. It is fully postgres compatible and has a port for advanced postgres features such as such as triggers stored procedures and partial indexes you might be can be deployed on VMs and Kubernetes in the cloud or on premise. You can automatically heal from certain class of failures and does its own native replication. It is proven and it's a proven database designed for scale. And your distribution. So, so just sort of a quick overview of like 1000 feet overview of what you by DB looks like. So, in this slide we have basically three main components. One is the you by DB master, which is referred to by you by by be master. This is the control plane of the database. This is responsible for bringing up the database like bootstrapping responsible for shard metadata placement and responsible for DDL operations like initializing and and modifying schema for database. Then we have by BT server, which is the data plane of the database. It is responsible for end user IO internally tables of a database are sharded in what's known as tablets. And these tablets are replicated times, whatever is the replication factor of that particular database installation. And then each of these tablets replicate replicated tablets are called tablet peers. And each T server is responsible for a section of depending on the data placement policy responsible for a section of serving IO to these tablets. And then we have a doc DB storage engine, which is an extension of open source rocks DB. We added a raft base replication and leader action layer on top of rocks DB and doc DB is used by by BT server and by the master both as the persistence layer. So also sort of how this looks like on a single Kubernetes cluster. So on a single community cluster, we deploy you by DB using hand charts, two main stateful sets are deployed. One is for by be master, the another one is for by BT server and appropriate part deception budgets are set in the stateful sets. So that, you know, if there's a plan or unplanned out is that takes a note out or takes a section of parts or we still respect the replication factor of the underlying database. We also said affinity rules on each of the parts to make sure that they land on different nodes. So if there's a disfailure on a node or if a node needs to be upgraded parts can be scheduled by the stateful set controller on a different node stateful set controller is pretty neat here because it gives us consistent naming and consistent storage. So when a pod moves from one node to another, it takes care of making sure it shows up with the same name identifier. And then also takes care of attaching the right volumes to the right parts. In case. So the data stored on persistent volumes in case one of the persistent volumes get corrupted. You might be has inbuilt replication. So it can rebuild data by reboot stopping from its peers. Oh, yeah. And then also we have the headless service that comes as part of the stateful set that the that is used by admin or app clients that are to talk to the by be master or by BT server. All so in general, Kubernetes clusters are usually single region. So how do you take a multi region database and deploy it on a Kubernetes cluster that is a single region technology. What we did we made copies right. So we run. So this is an example setup where we are running it across three different regions. So we have three different communities clusters connected by Istio. Istio is a service mesh that allows deploying workloads across multiple Kubernetes clusters and pods. And basically it allows us to configure the configure the data plane such that pods running on one cluster can then talk to pods running on another cluster. In this setup we run Istio in a multi primary multi network mode, which means that we have three copies of the Istio Istio Gateway Istio Ingress Gateway and Istio D running in the setup so one on each cluster. And we also expose using DNS proxying expose the services that are running on each of the clusters to the other two clusters. This enables the pods to have full connectivity. So any pod running on each, any of these clusters can then talk to pods running on the other two clusters. And at this point we deploy the UB by DB Hem chart that we talked about, but we deploy three copies of it. And we deploy one so that one of basically one ends up on each of the clusters. And we also set as configuration in in YB master, we set appropriate placement policies so that the database when it gets data places one replica of the data in each region and also sets the application factor to three. What this gives us is that if one of the regions of the in the setup was to go down the database availability is still maintained application availability is unaffected. So application can basically transfer to a different region. If you're using you might even have something called smart clients. So if you're using smart clients, they're intelligent about it and they can transparently load balance or move the traffic to another region automatically. So, so that gives us fault tolerance in terms of a region failure. But this is so let's say a region does fail. This is not a great state to. Let's say region does fail. There's not a great state to be in because even though your application availability is maintained and your application continue to function another failure in this setup will cause data and availability will cause application availability to be affected. So, so we want to recover from this setup right in the previous setup state one set sort of gave us that ability right it will bring up the part of a note fails but like how do we do that in this kind of setup. And by the issue is representative here this can be done with an any MCS solution GK MCS EKS MCS. So, so, so basically you know how do we recover from these outages right so to recover from these outages what we need to do is first we need to detect as a database that you know what kind of an outage is it a permanent outage which is a hard problem because you might not have the service accounts permissions or roles to be able to detect this right. The next thing we need to do once we've detected it and confirmed an outage we have to provision a new or use a standby Kubernetes cluster. Right. And then we have to reconfigure our service mesh to add this new cluster in deploy the hand you go by db hem chart again and reconfigure the you by db cluster to add to add basically to add the old failed replicas, remove the old failed replicas, then wait for data migration and hope that there is no more failures while this is in progress. That doesn't sound like fun times to me, especially when you're dealing with an outage and in a stressful environment like this. So, you know, we can automate these things and have like, you know, build a runbook, but every Kubernetes cluster in an on-prem or even in the cloud setup is slightly different, right. So consistency consistently executing and expressing these runbooks is not trivial. And that's where a multi cluster orchestrator steps in and helps us. Thank you man for the deep dive into your goodbye db. We'll now learn about what a multi cluster orchestrator is. To put it very simply, it is a control plane that enables deployment of your Kubernetes workloads across a fleet of clusters. There are a number of orchestrators now available in the ecosystem. You have Karmada from Huawei, open cluster management and ACM from, or ACM from Red Hat, Rancher fleet. You have Kube-Stella on KCM, which is being contributed to by IBM Research and Red Hat and a lot of product Nova. So what is coming into your cluster orchestrator? It is your typical set of workload manifests. And in addition to that is a schedule policy, which is the core essence of how these orchestrators work. We look at what that consists of. Okay, the schedule policy does your mapping between your Kubernetes resources and the specific cluster that you want to run it on. Here's a simplified schedule policy. It has a resource selector which says, here's my subset of resources that I want to match with this policy. And a cluster selector which chooses the specific clusters you're interested in deploying this workload to. There are a number of schedule policy types and annotation based policy is one in which you just take your workload and add an annotation to it with a cluster ID. A capacity based scheduling policy is much more interesting. Say as a developer, I have a CI workload that I want to run on any one of my dev clusters that has sufficient resources. You'd use a policy like this. The orchestrator looks at your pod resource requests and looks at your cluster availability and does the mapping. We also have the concept of include and exclude lists. This would be useful in the case of say, I'm ready to move my workloads from staging to production. In which case, typically we push all our workloads to all US regions except West, which is say our max loaded region, which we push 24 hours later, you could use an exclude list. We could also use an exclude list for certain clusters that are being upgraded or in maintenance mode. So these come in handy. Now, there are certain advanced policies that help specifically for Yugobite and Istio workloads. An example of this is a spread policy. This simply says that take my incoming workload and duplicate it on multiple clusters and add overrides. This is the key feature that allows you to modify certain pieces of your manifest with custom values on each cluster. Here's a snippet picked up from an actual policy that we're going to show in our demo. As you can see, it specifies some spread constraints. It uses a duplicate mode. The divide mode is an alternate mode, which we're not using here. We've had prospective customers who want to split a single deployment into multiple clusters through percentage specifications. So that's what divide allows us. As you can see, the particular ingress gateway needed to be overridden with two values that had cluster-specific values. Here, our cluster was conveniently named West, and so we had to change that value to a West network and a West cluster for it to work. Okay, so now, yeah, a cork strater helped me set up Yugobite and Istio on my fleet of clusters. Why is it involved in fault tolerance? The key property of the orchestrator that allows this is that it has both visibility into your fleet as well as control. It is in the critical path. And it has two aspects that it controlled, both your workload and your cluster. So this is what makes it different from, say, a typical cluster lifecycle management tool that you'd be using for your crud of your clusters. So this is what allows it to coordinate the set of complex recovery steps needed to bring your database back from a degraded mode of operation into your highly available mode after a failure. So summarizing our problem, a cloud native DB like Yugobite provides the scale, resilience, and performance you would need for your geo-distributed applications. It does have inherent resilience, which makes it essential. However, it handles within cluster failures such as nodes, disk failures, and network partitions to be handling when your fault domain becomes your regional cluster, the orchestrator comes in and provides you with zero-touch fault tolerance. This is a graphic of the demo we look at right now. On top on the green block is the orchestrator. It has a scheduler which handles the schedule policy custom resource. It has a recovery webhook which can receive alerts from your monitoring stack, which could be any on-prem monitoring solution. In our case, we'll use Google Cloud Managed Prometheus. At the bottom are your fleet of clusters named East, West, and Central as they are deployed in those regions. The blue boxes represent your Istio and Yugobite workloads. The green agent is what allows your orchestrator to control your fleet. On the left, the dashed box shows a standby cluster which will pick up workloads and do the reconfiguration needed in case of failure. So the recovery webhook is listening in as soon as it receives an alert. It sets up your workload, it reconfigures it, and makes sure it is able to communicate with the West and Central clusters that are still up and functioning. Okay. What does recovery involve? It requires two steps. One is a recovery policy which simply says take my recovery job, run it on the East Prime standby cluster. It consists of a sequence of steps which include the Istio prerequisites, deploying Istio, validating it, creating secrets for communication, followed by deploying Yugobite, followed by using the Yugobite administration tool to reconfigure your Yugobite universe. So we list them in detail because this is what your infra team would be doing manually during an outage, and we want to take that toil away from them. Okay. Since we're going to be going through a number of terminals, we'll include slides to kind of overview what you're going to see. The first step is we'll have five Kubernetes clusters on GKE. We'll deploy the orchestrator as well as your workload clusters. We'll deploy Istio and Yugobite DB, and then see what happens. We'll do it a little faster. This is all the KubeCon fix in my local environment. This is GKE in which I have five clusters. I first installed an over control plane. It takes a few minutes, and then we install the agents on the fleet. And once we do this, we do not need to talk to any one of our fleets. We just keep talking to the control plane. We then do a KubeCuttle get clusters. Cluster is not an inherent resource in Kubernetes. It is what your orchestrator makes available. We call, we rename those contexts to Central, East, East Prime, and West, all in different regions. We see that all of them are ready and are willing to accept workloads. We start with deploying Istio. The service mesh needs to be deployed first before we can get to our DB workload. This is a script that basically does a sequence of KubeCuttle apply commands of both the policies and the Istio workload. Once we deploy it, we check the Istio namespace. We ensure that the Ingress Gateway, the East-West Gateway, and the Istio Depods are available. We make sure that an external IP has been provided to it by the cloud provider. We then install some remote secrets generated by the IstioCuttle command. We double-check that these secrets are available. West will have secrets titled Central and East, so it can talk to its YB node partners. Take a quick peek at the time. Okay, we do have time. Okay, we then start deploying Yugo by DB, which is once again a Helm chart that can be targeted at just the top level cluster orchestrator's API server. It does not need to talk to your fleet. Okay, once Yugo by DB is deployed, once again we check all its services. We make sure the master pods UI is available. We'll then go to the browser and ensure that it is up. We check that the team master and the YB server pods that Aman mentioned to us. Let's pause here a bit. On the right-hand side is what is most interesting. You see three YB nodes. On the second column, you'll see their raft roles. There's two followers and a leader, and you see that they've all come up. Their uptime is about two minutes. Okay. Next, we'll set up the recovery steps. As we said, it's a policy and a job. We'll be applying them both. Currently, we use a schedule policy where we highlight that it's not being deployed, which means the job is in pending state. It's waiting for an alert, and it'll be deployed as soon as the alert is received. We apply the recovery job. We check that it's running. It's available, but not running. It's at zero bar one. We then start an end user workload. This is what is going to be running continuous SQL operations. We keep a lookout on the read and write ops. It's about 140 here. We'll notice that it does not change. We then go into our Google Cloud console. We set up a Google alert system. For simplicity, we're using a master pod ready status. This is not what you'd be doing in production. You would have a complex alert with a number of different application and system level metrics. So we use metric absence as the trigger condition. We then set up the notifications. This is the key part that closes the loop in addition to sending your admins an email. It will talk to the orchestrator's UI endpoint. We then inject a failure. We'll edit our stateful sets to set the replicas to zero. Thank you to some of you who are nodding your heads. This is making sense. Okay. So the final step is the recovery. We see that a few incidents get alert created. These incidents, you're not doing anything. The orchestrator edits its policy, chooses a standby cluster to deploy its workloads. You'll notice here that, as you can see here, East Prime is the standby cluster that's been chosen. And we see that the node has come back up. The two Yugo byte pods at the bottom of the screen, the stateful sets are up and running. They've been running for about two minutes. And we'll go to the UI. We kind of refresh it. We see that the bottom most is the new YB node. You'll notice that its uptime is about 59 seconds. The other two have been running for about an hour and 20 minutes. To make sure everything did go okay, we once again go into our end user workloads. And see that. Oh, there's another check we do. We go into Yugo byte metrics that are made available. Aman, would you like to talk about those? Yeah. So each of the YB masters exposes a metric called follower lag. Basically, that's coming in from the underlying doc DB. And it shows how far a follower is behind the leader. In this case, it was, I think, 120 milliseconds behind. So that's a good indicator of that the cluster is healed. It's ready to accept the workload basically. And we can see here all this while, while the recovery was going on, none of the data availability was lost. You know, region rights kept continuing. The smart client was redirecting traffic transparently to the other replicas that were alive. Okay. So yeah, we do notice that it's about the ops performances for read and write continue to remain the same without any issues. Okay. So what are our takeaways? Cloud native apps are well served by databases that are resilient like Yugo byte DB, uh, which have self healing capabilities to rebuild state in the presence of failures. Multi cluster orchestrators complement this by automating infra level recovery tasks in the presence of region and cluster level failures. So what is the benefit for your business? You're insured that business continuity goes on as usual. You minimize revenue loss by reducing the toil involved in setting up your clusters, your workloads and reconfiguring them all during an expensive outage. And another important effect of this is that you're able to run periodic fire drills and chaos tests. It is said that an HDR policy that is not actually periodically tested is equivalent to having no HDR policy. So this will warm up your teams to be able to do these tests, uh, much more regularly as part of your product testing. Okay. We have some exciting ways in which we're extending this work. We want to avoid the use of standby clusters instead use just in time clusters. Uh, the orchestrator is capable of cloning existing clusters on demand. Uh, secondly, some of our prospects have so told us that they want a human in the loop. They don't want zero touch. They would like some friction and some all okay, uh, uh, auditing trail to in their, uh, fault tolerance triggers. Uh, secondly, you notice that all of our recovery was in captured in a Kubernetes job. We would like to make that a CRD to make it more flexible to make it more declarative and generic across different DB use cases. Uh, finally, we're also extending our spread capacity schedule policies to be cost aware and latency aware, which will help in these DB environments further. And, uh, most importantly, thank you to our teams. We have Machik here, uh, an experiment that takes me two hours. Take, took him about 200 hours to get all this magic in place. So thank you, Machik. And, uh, rest of team in Lottel and, uh, Michael from team you go bite. So if you want to be involved, uh, please try out our products, uh, and come talk to us. And if you have other, uh, day two operations for orchestrators, we would love to talk to you and happy to take questions. If you want to learn more about you go by DB, uh, we have a booth C 11 and also, uh, a concurrent event DSS day going on right now in level four horizon ballroom. So yeah, there's a lot of detail talks for you by DB. Okay. Happy to take any questions. Anybody have questions? Hey, I'm an instantly awesome talk. So question on how two part question. So how long does it take for a failed database to come back up in a different cloud? That's part one. Part two is could you talk to the complexity of, uh, the recovery point objective from a database point of view. Database are pretty complex short version of what you had to do in the database to make sure that came that it came back up with the right snapshot in place in the, in the target infrastructure. Thank you. Um, do you want me to take that? Sure. Um, so, uh, from the database, uh, perspective, the, like our RPO is, uh, uh, zero and RRQ is three seconds. Uh, this is because, uh, uh, we, uh, like as soon as a failure is detected internally in database, it transfers leaders from that field zone into another one. And basically availability keeps functioning as far as what is needed to get the new replica up in a single cluster scenario. It's, I mean, stateful set brings it up, moves the volume over, even in case there isn't like, there was the volume got corrupted or got missing for some reason. It's, it's, it's a very simple, um, actually T servers automatically recover and rebuild data from state for master, because it's a control plane node and it's also responsible for bootstrapings on the others. Uh, we have to run, uh, one actually just one command to get master back to state. And it's, it, uh, from as far as the time perspective, it also sort of depends on how much data it needs. Like if there is existing data from, from an existing snapshot, if it's in this case building from a completely new scratch snapshot. So if it's copying a bunch of data, then it takes a little bit longer to be fully functional. Um, and we can, like for replication, we can drive line rate. So whatever is, we can drive line rate as in like whatever is the line rate on between the regions. And, um, um, in this setup, I think it took about like 50 seconds to get the master up and running and everything healed up. I think it was 50 seconds or maybe like two minutes. Two minutes. Two minutes. Okay. Yeah. Yeah. Thank you. I mean, the gentleman here asked about RPO. That's primarily for high availability, I suppose. What about disaster recovery? What about backups? If you need, for example, to do point in time recovery, how do you address that issue? So UITB has a fully mature backups and point in time recovery system. Um, uh, it, we use, uh, we expose it via our management plane. So, and are also building a Kubernetes operator that talks to the management plane to do backups and, uh, point in time recovery. And is that, does that answer your question? All right. So there is another product, uh, another product that we have called Ugobyte anywhere, which is our management plane and backups in point in time recovery constructs are exposed via that management plane. Uh, yeah. So, so depending on, yeah, so depending on how backups are configured, so like we support a few options, there is something called X cluster, which is you can have a remote Ugobyte DB cluster, take, uh, uh, basically act as a backup cluster, uh, and then constantly transfer workload between them. And this is, this supports incremental backups as well. You can do offside backups to S three. Um, this also supports incremental and full backups. We have scheduled backups as well. So you can do it on a schedule. Um, so because, uh, a lot of that construct, uh, for a distributed, uh, like, uh, because you also, uh, need to do this for multiple databases, right? Not just one. So that's why it's exposed via our, uh, management plane. However, we're also bringing it, uh, we are to the, uh, to the Kubernetes cluster by building an operator around it. Yeah. Okay. Thank you. Uh, and to add to that answer. So the multi cluster orchestrator can also help with, uh, DR. We actually have a talk later this afternoon at the, okay, where we work with Percona and, uh, Postgres to, uh, initiate and trigger this automated DR process too. So any other questions? Yeah. Um, thanks, Sylvie. Thanks. Um, I had a question about the recovery job itself. So how much context would be passed to that recovery job from the point of failure that's detected? Meaning that is that recovery job sort of a statically defined thing or can there be input into that to know maybe how to recover from different types of failures? Uh, yes. It had to be handcrafted with someone who understood the yoga bite architecture, but yes, it can totally be customized. It is, uh, you know, it is a Kubernetes job. It's going to be checked into your Git repo and it can be customized to your particular use case. Perfect. Thanks. Anybody else have questions? Yeah.