 Welcome everyone to this breakout session. We are live in Detroit. Welcome also to our virtual audience. Our talk today is one VT org to rule them all. High availability in a distributed system. Distributed database system. And the distributed database we'll be talking about is WITES. My name is Deepthi Sigreddi. I'm a maintainer and technical lead for WITES, which is a graduated CNCF project. And I'm a software engineer at PlanetScale. Hello, I am Manan. I'm also a maintainer at Office Desk and a software engineer at PlanetScale. We'll start with a WITES overview. WITES is a cloud native database, but it originally started as a scaling solution for MySQL at YouTube. And from those beginnings in 2010, it has evolved to become a general purpose distributed database that is suitable for many workloads and many use cases. WITES is massively scalable. If we look at some of the numbers, the largest instances of WITES run to 20,000 terabytes, which is 20 petabytes, 22,000 replicas, 250,000 simultaneous connections, though that is not a limit. You could go higher and so on. And one of the ways WITES achieves the massive scale is through sharding, which we will not talk about in this talk, but we have a maintainer talk tomorrow where you can learn more about how WITES does sharding. WITES is also highly available. That is going to be the focus of this talk. And WITES is MySQL compatible, both 5.7 and 8.0 versions. We'll now get a brief overview of the architecture of WITES. So clients talk to WITES through a proxy called VTGate, and VTGate receives the client request, which can be either MySQL or GRPC. Those are the two protocols it supports and routes them appropriately to the backend MySQLs. So in a sharded WITES system, you may have multiple shards, but in each shard, there is a primary MySQL and there are one or more replicas. And VTGate figures out which shard to route it to. And when we run in this primary replicated mode, we have to make sure that each shard is highly available. In addition to this query serving path through which VTGate receives the queries and routes them to the backend MySQLs, there is also a control plane, and this is where VTORC comes in. There is a WITES control daemon, which takes management commands from users through CLI or the web UI and executes them. There is VTADMEN, which powers our web UI, and then there's VTORC. So let's go through a little overview of VTORC and then we will have a demo. What is the problem that VTORC is trying to solve? The main problem that VTORC is trying to solve is resiliency to MySQL failures. So we are running MySQL in a replicated mode, but even then if the primary MySQL goes down, we have to deal with that failure mode. And we need to be highly available and that's the problem that VTORC is solving. However, while ensuring high availability, you also want to be able to make sure that the data is durable. If a transaction has been accepted by the system, then it shouldn't be lost. That data should be persisted somewhere. And during this process of detecting and recovering from failures, we want to minimize the time taken for the recovery. So what was the state of the art before we built VTORC? There was OpenARC Orchestrator, which was a different open source project that a lot of large MySQL deployments used to deal with failures at the MySQL level. And there was an integration with Wittes. The integration worked well enough most of the time, but not all of the time. In addition to that, the way durability was provided by Wittes was through semi-synchronous replication, which we will go through in a minute. And that configuration in Wittes didn't really work well all of the time. We discovered failure modes where the semi-synchronous mode setting would be lost. Whatever the user intended would not get set on the MySQL. So over time, we discovered all of these problems with the orchestrator integration and how semi-sync was implemented in Wittes. And we decided that this thing needed to be rebuilt from the ground up. So VTORC is now GA. We announced Wittes 15 general availability yesterday and VTORC went GA yesterday. But VTORC is already in production. There are at least two deployments that we know of where VTORC is being used to do unplanned failures and to deal with primary failures. So I'll just give you a minute to read this quote from a user of Wittes who's deploying VTORC with Wittes in Kubernetes using the Wittes Kubernetes operator. Let's get a little deeper into what VTORC is doing. We talked about what functionality it provides, but how does it do that? So VTORC is the agent that is monitoring the Wittes cluster and detecting the failures. So it has logic to check for certain conditions at periodic intervals and flag failures. And once it detects a failure, it goes through a repair or recovery mechanism. The way VTORC achieves durability is through replication. Or in general, the way Wittes achieves data durability is through replication. So one primary more than one replica gives us guarantees that any transaction that's accepted by a primary is replicated to at least one replica. So that in case of a disk failure or a node failure, you don't lose the data that a client thinks has been persisted. And the way we achieve high availability is through failover. So when a failure is detected, we call it an unplanned leader election. But in the normal course of things, you have to upgrade software, you have to upgrade hardware, whatever it is. In the normal course of things, you want to be able to do planned failovers so that the system remains available through maintenance operations. And that's Wittes' general strategy for high availability, and VTORC is the agent that enables those things. The design principles on which VTORC is built came from consensus systems. So consensus in a distributed system is a problem that people in academia and industry have been working on for many years now, I think 40 plus years. And there are well-known algorithms to solve the consensus problem. In Wittes, what we did was that a group of us studied some of these algorithms and then we came up with an engineering approach. So this approach prioritizes performance over theoretical correctness. Not to say that the system is not going to be correct, but just that we did not prioritize writing a formal specification and validating it. We've taken an engineering approach to solving the consensus problem. And we are not interested in purity in the sense that we don't have to be the consensus algorithm. So Wittes has a topology server which can be at CD or zookeeper or console where we store some state and it is used for discovery. So we actually lean on the topology server to provide the persistent state that can help us in recovery. Wittes is a single leader system and one of the principles we are trying to maintain is that we need to be able to fulfill requests while respecting the durability policy that has been chosen by the user. And durability versus availability is a trade-off and that's a configurable trade-off so that people can decide which is more important and under which situations. We have also separated the planned leader election from the unplanned leader election. So there is the normal operation of the system which is very high QPS. We expect Wittes systems to get millions of queries per second for reads or writes versus leader elections are much rarer. So planned leader elections in a Kubernetes environment maybe once a day or once a week or in a non-Cubernetes environment maybe even rarer. And unplanned leader elections are expected to be relatively rare. So we've separated these things to say that the way you achieve consensus for normal operations is very different from the way you do it for the leader election process. The system should be able to make forward progress and we need to be able to deal with race conditions. Given that VTARC is an agent, it itself is susceptible to failures which means that you may be running multiple VTARCs and they may be running in multiple regions and we have to deal with failure modes for VTARC itself. So whatever we came up with has been written up in a blog series by one of the founders of Wittes, Sugusugmaran, and we have a link to it towards the end. So what we came up with in terms of leader election is that leader election has three distinct stages. There's revocation, revoking the previous leader, election, choosing a new leader and propagation, which is any requests that were accepted by the previous leader and acknowledged to a client need to be propagated by the new leader to all of the followers or replicas. So in a planned leader election revocation is actually much simpler than in an unplanned leader election and this is one of the reasons we've separated those two. In a planned leader election, because the current leader is still available, you just ask them to step down before selecting a new leader. So it's an orderly transition. A new leader is chosen. It's possible that there are still some completed requests that have not been propagated to all of the replicas. The system will make sure that they get propagated to all of the replicas. So that's what a planned leader election looks like. So as an example, the leader changed and the yellow arrows which represent replication have all moved over to the new leader. An unplanned leader election is a little bit different because in order to revoke the leader in an unplanned leader election, we have to be able to reach a sufficient number of followers. So I'm calling that M and what M is we will find out in a little bit. But once revocation is complete, things look the same as for a planned leader election. A new leader is chosen and how the new leader is chosen also depends on the durability policy and completed requests have to be propagated. So let's say that leader became unavailable and a new leader is chosen. All of the replicas now replicate from the new leader and the previous leader is out of the topology. We've talked a fair amount. I've said durability policy quite a few times. So I want to talk about what exactly a durability policy is in the Wittes context. So the definition of durability that we are working with is that if a client request has been acknowledged, any changes that that client made to the data are persistent and are not lost. So we are talking about not losing data. So data loss is what we want to avoid. MySQL has a synchronous replication mode and a semi synchronous replication mode. And with the right parameters, it's you can configure semi synchronous replication in such a way that any right to the primary will be blocked unless an acknowledgement is received from at least one replica. And the acknowledgement from the replica doesn't mean that that transaction has been completely applied to the replica. It only means that the replica has received it and persisted it in a way on disk essentially. So what the durability policy is defining is basically how this is configured for Wittes. Semi-sync is the foundation, but with the durability policy, you can configure which replicas are eligible to be promoted as a primary. For whatever reason, some replicas may not be eligible. How many semi-sync acknowledgments are required for each primary? Most people use one, but there may be situations where people won't use two acknowledgments. Which replicas are eligible to send those acknowledgments and so on. And what durability policies give operators of Wittes is increased flexibility in their configurations, in their topology, in how they lay out their replicas for worldwide replication or across data centers, regions, availability zones, those kinds of things. So now I'll hand it off to Manan. Thank you, Deepti, for that introduction. And now that we've looked at what durability policies are and how Wittes tries to ensure durability, we can go ahead and look at the revocation phase in a little more detail and look at exactly what M is that Deepti alluded to in a few slides ago. So how do we know that we have been able to reach sufficient tablets to guarantee safety? To answer that question, we first have to look at a different concept called intersecting quorums. We just saw in Semi-Sync that in order to accept a transaction, a primary requires some amount of Semi-Sync acts coming from the replicas. So a primary along with the set of replicas that are sending the Semi-Sync acts together from a set of tablets that can accept transactions. This set of tablets which is capable of accepting rights is one quorum for accepting rights. There can be many such possible quorums. In order to guarantee safety during the revocation phase, we should be able to reach at least one tablet from each of the possible quorums. The intent is that if there are unreachable tablets which can form a quorum, they would accept rights which we don't want them to do. The reason is that an unreachable tablet and a tablet which is down are virtually indistinguishable from the VTR perspective. Let's take an example to make this even clearer. Let's say that we were in a scenario where we had six tablets in which you only require three tablets to accept transactions. If, for example, something was network partitioned like so that you had two VTR instances running but on two different sides of the network partition with three tablets on either side. In this situation, if you do not check for the revocation phase at all, each of those VTR instances can actually promote one primary and that primary will be able to accept rights because you only require two more Semi-Sync acts. So we would have one primary elected here, one primary elected here and that would not be a correct situation because you could very well be in a split-drain scenario. Two of these primaries accepting rights, they can diverge because they could have accepted different rights. They aren't replicating to each other because they're network partitioned. So in this case, it would not be safe for any of the VTR instances to proceed because from this VTR instances perspective, there are three tablets which can form a quorum that it can't reach and the same for the other one. There are three tablets that it can't reach which can also form a quorum. So in this case, it's not safe for either of the two VTR instances to proceed. If, for example, on the other hand, we were network partitioned such that five tablets lie on the same side and one tablet was on the other. In this case, that VTR instance can make forward progress because it can reach five tablets and it is enough to guarantee revocation because the one tablet which is there on the other side, it can't actually accept rights because two Semi-Sync acts are required and we can reach enough tablets to guarantee that that tablet never gets that many acts. I hope this concept is clear. Let's take a look at the demo. So the demo that we're going to look at, it's actually going to be in two parts. The first part is going to use a Semi-Sync durability. For that, we're going to use a Semi-Sync key space and the durability policy that we're going to use is Semi-Sync. The way that it works is that any replica can be the primary tablet. The primary tablet only requires one Semi-Sync act to accept transactions and any other replica can send those acts. In our example, we're going to have four replicas, zone 100, 101, 102 and 103. So if you take a look at the VTR admin page for this, here's a free space, the Semi-Sync key space, and it's running with the Semi-Sync durability policy. And if you take a look here, we have four replicas, one of which has been promoted as a primary. So if you take a look at the corums that can accept transactions because any tablet can be the primary and any other tablet can send a Semi-Sync act. Essentially, any two tablets together can accept a right because one of them can be the primary and the other can send the act. For example, in this case, 100 can be the primary and 101 can send the act and that is enough to accept a right. So overall, this is what the corum set for accepting rights would look like. Any two tablets can accept a right. So if you take a look for corums that have revocation, if you're able to only reach tablet 100 and 103, that is not enough to guarantee safety. Because you would have tablet 101 and 102, which can accept rights with 101 being the primary. But if you're able to reach three tablets, 100, 102 and 103, that is enough because there's only one tablet which is unreachable, the 101 tablet and it can't accept rights on its own. If you're able to reach all the tablets, that is enough too, and one is not. So in our example, we have four tablets, one of which is the primary. If we go ahead and shut that primary down, we expect that VTORC will be able to make forward progress, promote someone else because it is safe to do so. There. In that much time, VTORC promotes a different tablet, which is zone 102 as the primary and the previous one get remotes itself to a replica, which is no longer in a serving state. But in this case, if we go ahead and delete one more primary failure, let's say in this case, our cluster had another primary failure. This time VTORC won't be able to make any progress. The reason is that there are two failures now, and these two tablets had they been network partitioned, they could have accepted rights. So VTORC can't reach them. It's not going to be able to make forward progress, which is what the flexibility is what VTORC provides, because you're able to configure the amount of semizing acts you require. It is up to you to define how much of a durability you want. If you set the number of semizing acts requirement to something higher, let's say five semizing acts and you have 10 tablets running, you can handle three failures, four failures. So it's up to the operators to configure the policies such that they define the performance that they want and the durability trade-offs that they're willing to make. The next example that we're going to look at is a cross-cell durability policy. This policy is similar to the semizing policy apart from one difference. So the way that we define a cell is a failure domain. It could be an Amazon region, it could even be your confidence in different providers. You could say that AWS is one zone and GCP is the other zone. However, you decide to do that. So once we have a cell, the cross-cell durability policy all is different in the semizing one is that the replica acts have to come from a tablet, which is in a different cell. So essentially when a primary accepts a right, the semizing act is coming from a different cell, so you're guaranteed that your right is persisted in two cells. So in this example, we're going to have three, we're going to have four tablets, three of which are going to be in zone two and one is going to be in zone one. And if you take a look at the yellow lines, those yellow lines signify who can send semizing acts to the other tablet. The tablet in zone one can send semizing acts to all the other three and can receive semizing acts from all the other three because it lives in a different zone, but not the other. Because zone one, 200 and zone two, 201 both are in the same cell, which is zone two, so they can't send semizing acts to each other, which is what the white line signifies there. So if you take a look at the corums for accepting transactions, this time the corums are much less than the previous example, because if 300 is the primary, it can get acts from any of the three tablets. But if anyone else is the primary, they always require acts from 300 because that's the only tablet available in a different cell. If you take a look for the corums of revocation, reaching 200 and 201 is not enough because you have tablet 202 which can accept rights with 300. 300, 202, 201 that is enough because there's only one replica that's left. 300 and 201 is also enough because if you reach 300, you're able to guarantee that none of them are actually able to make forward progress because 300 is part of all of those corum sets. Let's take a look at this key space now. So we have the cross cell key space and it's going to use the durability policy called cross cell over here. Initially, one of the tablets in the second zone has been promoted as a primary zone to 202. If you take that tablet down, okay, so VTRK is able to promote someone else as the primary zone. One 300 gets promoted as a primary in this case from the VT admin page. We can also go ahead and repair and our cluster to a different primary tablet. For example, if you go over here, okay. So now zone two 200 is the primary. If we go ahead and take this primary down as well, this time to VTR will be able to fix this failure. And the reasoning is again in the durability policies because 300 and 201 are enough. 200 and 202, even though there are two tablets, they can't send some icing acts to each other. So they can't accept rights between themselves. And this is where the difference between the semi sync policy and the cross cell policy comes in because the failures are in the same cell. Cross cell policy can make forward progress even though some icing could not. So in the case of semi sync policy, two failures was enough that VTR was not able to make forward progress. In this case, it can. It has promoted someone else as the primary zone to 201 and the things are resumed. But any more failures than this, even 201 tablet going down, that would again result in a scenario where VTR can't make progress because it won't be able to make forward progress. There's only one tablet available and you need one semi sync act to even accept any rights. And that concludes the demo for both of those key spaces. I would like to talk a little bit more about the custom durability policies. So the way that the durability policy works is that you only need to answer three questions when you implement your own custom durability policy. This is encoded as an interface. So first thing you need to answer is who all can be the primary? So you can say that off the tablets that you have, only some tablets are allowed to be the primaries, others aren't. They could be read only tablets that you only custom for maybe OLAP transactions and you don't want them to become primaries at all. The second question is how many semi sync acts does each primary require? And this can be customized for each tablet that can be the primary. You could say that you have too much confidence in one of the primaries, let's say primary tablet A. It requires three semi sync acts, but you want tablet B to only require one semi sync act. That is also possible. And the final question is the number of semi sync acts that are defined, who all tablets can send these acts based on who the primary is? With all of these three questions, you can actually implement your own custom durability policy based on the constraints that you have and the trade-offs that you're willing to make. Apart from the primary failures, VTRG also handles many other failure scenarios that MySQL instances generally run into. So it maintains the cluster's desired state. If the primary goes into a read only state or the replica's replication is stopped and so on, the list of failures are listed here and many more, these are all the things that VTRG is able to handle. So if you have VTRG running, it's kind of an automated scenario where you just don't need to look at anything until something drastic happens, something that you did not prepare for or you're not expecting according to your durability policies to handle. If that happens, then manual intervention is required. Otherwise, VTRG is an automated process that will take care of your cluster and keep it healthy all the time. Thank you. And we're going to move on to the question and answer round. Before we do that, there are some resources that we have. Tomorrow, there is another VITES talk, which is going to talk about sharding and new features in VITES at 11.55. And the blog post series is linked here. And the VTR documentation. Please leave feedback. You can use the QR code for that. We would love to hear from you as to how the talk was and how we could improve. Thank you. Any questions? We also have a booth in the solution showcase for VITES. Please feel free to stop by to talk to us. Any questions from the audience? Yes. And the second question is, I noticed you were careful not to say M stands for majority of the... I noticed that you said like M, the primary M, doesn't stand for majority, you know, for tablets. Right. Like for scenarios where, you know, M is let's say only two replicas and we can have a potentially split brain scenario, like when it's less than a majority. How can one, you know, recover from those scenarios? How can we reconcile in the very scenarios where split brain can happen when M is less than majority? Let me repeat the question. We'll take the second question first and then I'll answer the first question. So the second question was about majority. The number of required replicas M does not have to be a majority. Yes. The way that M is written is that it can actually even be variable based on the failures that you have. In the cross-sell policy, if you actually have a failure of the zone one, zone 300 tablet, the one which is in the other zone, the only tablet there, that is enough because you only have one tablet running in that zone. In that case, none of the other tablets can make progress. So if we go back to that slide, in this case, if this tablet goes down, there's actually enough to bring down the entire cluster because you'll have only one tablet in a different zone. But that also means that you can handle multiple failures from this zone. Two tablets going down from here is what we saw in the example. It's not even a constant based on the durability policy. It could be one, it could be two. It's all dependent on the kind of failures you have. And the end of the question that you had as part of the second question was what happens and how you reconcile if you end up in a split brain scenario. The way that VTR works is that it will not allow you to get into a split brain scenario. It will stop writes at that point. In the example, in the situation that I described over here that if you were network partition like so, and you had one of the primaries running in this cluster, it went down. In this case, VTR won't promote anything else because it's not safe to proceed. It'll actually wait for manual intervention to come in, in which case the operator can decide how he wants to proceed in this scenario. He can actually take down all of these three tablets and just promote one from here. So it's up to the operator to decide how they want to take things forward. So the way that VTR works, it will not let the cluster go into a split brain scenario. I hope that answers your question. So in a semi-sync scenario where you only require acts from two other participants, why can't we have a split brain scenario here? Let me answer that question. So the basic principle that is preventing that from happening is quorum intersection. So what we are saying is that if you have, let's say, five tablets altogether and you require two acknowledgments for the primary to accept a right, then when you try to revoke the leadership from the current primary, you need to form a quorum that includes at least one of those replicas that is able to acknowledge. So in that example of five tablets with two acts, you need at least three replicas other than the primary to agree that this current primary is going to step down. So the quorum for normal operation is different from the quorum for leadership election and they have to intersect. And that's the same principle as has been published in the literature quorum intersection. Thank you. And the first question, which was about the API server outages. Yes, so the question was about what happens if the Kubernetes API server is unavailable. So VTR needs to be able to communicate with the topology server where we store state that doesn't necessarily depend on the Kubernetes API server. So as long as the networking in the Kubernetes cluster is functional and the topology server is reachable, we will still be able to perform an election. Any other questions? All right. If there are no other questions, we will end here. Thank you all for coming out.