 Hi, Roman. Thank you for coming. My name is Shlominoch and this is MySQL and the CapTheorem. We're going to see whether CapTheorem is relevant or irrelevant to production MySQL systems. I'm with the GitHub database infrastructure team. I'm assuming you know a little bit about GitHub. We do run MySQL in our back-end and clearly we're concerned with consistency and availability. We want our services to be available, right? We want you to be able to open a pull request page or an issue. Raise your hand if you've ever heard of the CapTheorem. Okay, so almost everyone. So the incentive to this session is a couple events. One was a blog post that I published at GitHub and the other one was an outage we had. And this sparkled community discussion. And people were saying things like, well, why did you design an available system? You should have designed a consistent system. Or, you know, CapTheorem says you can't have consistency if you prefer availability or vice versa. And basically waving the CapTheorem and saying, you know, one thing can be done and the other can't be done. It's like an E equals MC square kind of limitation. And I found many of these comments to be inaccurate, if not incorrect. So in this session I would like to explain what CapTheorem actually means and how it applies to production systems. So 20 years ago, Eric Brewer conjectured an idea that's the CapConjecture. And he said a system can be available, can be consistent, can be partition tolerant. It can't be all three. You have to pick two out of the three. That was the conjecture. A few years later, Lynch and Gilbert proved what we call the CapTheorem. Their paper was called after the CapConjecture by Eric Brewer. The name Eric Brewer appears on the title. But misunderstood by many, they did not prove Brewer's conjecture. They proved a different conjecture. You see, they changed the terms, and they made a mathematical proof. Now I've studied mathematics. I'm a great fan of math. And in math, a theorem only holds true as long as its terms and conditions are met. You cannot just change conditions and say, yeah, the theorem still holds. It doesn't work like that. So they use different terms. And because everybody today quotes the CapTheorem, we'll go by their terms and explain what their terms mean and how do they apply to our systems. So the terms are still consistency, availability, and partition tolerance, though the word availability lost the high part. And it goes like this. I'm going to read exactly the definition. So consistency. Once the right is successful on a node, any read on any other node must reflect that right or any later right. So say there's a table, there's a value, the value is 1, and I run a series of updates. I update to 2, 3, 4, 5, 6, 7, 8, 9, 10. I just updated to the number 7. I got an OK from the commit acknowledged. I go to any other node. I should see the number 7 or 8 or 9 maybe newer writes were taking place, but absolutely I shouldn't be seeing 5 or 6 or 4, right? Make sense? Availability. That's the big change from Grover's definition. Pay attention. Every non-crushed node must respond to requests in a finite amount of time. That is the definition of this theorem. Any non-crushed node must respond to requests in a finite amount of time. Now, a few notes about this. There is no limit on this finite amount of time. It could be 5 seconds. It could be 5 minutes. It could be 7 years. If a request returns within 7 years, the system is available. Now, it is not explicitly declared, but otherwise this disproves the cap theorem that the response must be successful. If the system returns errors all the time, then it's not available. It must respect that request which we sent. Another last thing to note, because this is a mathematical definition, every non-crushed node must respond to requests in a finite amount of time. What we have in math is like truing an empty way or truing an empty statement. According to this definition, a system, a database system where all the nodes are crushed is available. Because only the non-crushed nodes must respond to requests in a finite amount of time. It is perfectly valid to declare such a crushed system as available according to these rules. Now, if I told you that we are interested in availability and so we are crushing all our database service to make it available, you would say that this doesn't utterly make sense. Agreed? Okay, we continue. Petition tolerance means basically what it is that the system can operate with network failures. And I will rephrase the entire cap theorem. Instead of doing the two out of three, I'll define it in this way. So, if the network is good, you can achieve both consistency and availability no problem. If the network is bad, then you might need to choose between availability and consistency. That's what the theorem says. All right. So, we're going to prove the cap theorem right here and now. This is literally the proof on the paper. It's absolutely not complicated as you will shortly agree with me. So, the proof uses two nodes, node N1 and N2. They're connected, they're replicating from each other, they're happy, everything's fine. We place a network partition between them. We put a barrier. This barrier is held for an infinite amount of time. Okay. It will never come back. It's an infinite amount of time. We assume that the system is available. Agreed? Let's just assume that it is available according to the definition of availability which we have just described. So, we write, let's say there's some roads, there's some data. The value is three here and three here and the system is consistent. We now write the value of seven on node number one. Because we said the system is available, that means that request will return in a finite amount of time. Whatever time it is. But since it's finite, it's valid to speak about what happens afterwards. And what happens afterwards is that we're going to read a value from node number two. And that too, if we assume that the system is available, is going to complete within a finite amount of time. It is therefore valid to speak about what happens next. What happens next is that a finite amount of time has passed. We place the value seven on node number one. There is no way it reached node number two because throughout this time the network was broken. QED, we assumed availability, therefore there cannot be consistency. Does that make sense? That's a proof. It's a mathematical proof. That's basically it. You know the cap theorem now. And as a mathematical proof, it is excellent. That's the way we prove things in mathematics, right? You want to prove that something is not possible. You show one counter example where it breaks, right? Otherwise it might be completely possible. But if there's one use case where it breaks, then it is not possible. It is not possible to achieve availability and consistency both together because there is that single use case which breaks it. But we as engineers understand that things are not black and white. For example, let's talk about real high availability the way we understand it as engineers. We already agree that there's no such thing as 100% availability. And we talk about five nines and four nines and whatever makes us happy in production, right? We acknowledge that there could be single use cases that we might not meet. If I'm to tell you that in the next five years there will be at least three seconds of outage in your services, okay, you will believe me, but you won't be shocked, right? We understand that. And we understand that we can make trade-offs sometimes. Last about CAP itself, it does not discuss a myriad of other topics which we as engineers are very interested in such as scalability, uptime, latencies, durability, etc. With that said, let's move on and see how CAP as we defined it applies to production MySQL systems. So we begin with the simple MySQL replication cluster. Sorry, that's the asynchronous replication single master multiple replicas. Is this an AP or a CP system? What do you think? Don't be shy. AP or CP? There's no time. Faster, faster. It's CP. Okay, so clearly we all agree that this cannot be a CP system, right? This is asynchronous replication. The master, there could be replication lag. Nothing guarantees that the data will be there. We know that on a daily basis. So clearly this cannot be a consistent system. I write here, I read here. Well, it's not the same value, surprise. So we know it's not a CP system. Is it an AP system? So my question whether this was AP or CP is really not a good question because it is neither. The fact that the system is not AP doesn't make it CP. The fact that the system is not CP doesn't make it AP. Why is it not an AP system? Well, this cluster doesn't really meet the definition or the state of mind of the CAP theorem. According to the CAP theorem, any node must accept a request in order to be available, right? And respond to any of our requests in a finite amount of time and without error. In this system, there is only a single node that can respond to write requests. All the other nodes do not even play that game. There's a unidirectional thing here. But the CAP theorem doesn't make these distinctions. It needs all nodes to be able to participate. And if this node goes down, no one is able to serve writes. CAP is not interested in failovers and what will happen in five seconds after I failover. According to CAP, this is not an available system. It is not consistent. It is not available. But we know how to get consistency, right? We can always read from the master. That's our way. Everybody knows that, right? You write to the master. You want consistency. You read back from the master. We know that it doesn't scale. But do we really care that not all replicas have data at that exact time? We have something. We have some solution. So it doesn't scale. So we can use ProxySQL with casual GDAD reads, right? This would work for pair connection queries and not for all queries. It is not CAP consistent. It is not fully production consistent. But we can mitigate a large part of the problem. We can offload the reads from the master. And GDW use Fresno to heuristically suggest that all replicas are up to date and we can offload queries from the master. This is not CAP consistent. But we can achieve better consistency with less impact on the master. Let's move on to Semi-Sync. With Semi-Sync, I write to the master and the write is not acknowledged until the event, the binlog event has been shipped to one or more replicas depending on configuration. Is this a more consistent system? Is this a consistent system right now? It is not a consistent system because the only thing that Semi-Sync guarantees is durability of my data physically on another node. But nothing promises that the data has been, the changes been applied on that node. I still have replication like the same as before. So this system is no more consistent. Is it more available? It is not more available because there's still a single master. It still doesn't fit the model. Is it even less available? Possibly, depending on configuration. If you put an infinite time out on Semi-Sync and all the replicas are isolated, that query on the master might never return which makes the system not available or less available depending on how long you wait. At GitHub, we have this hybrid setup. So we have local DC replicas use Semi-Sync with a master wait time out of 500 milliseconds and remote DC replicas don't use Semi-Sync at all. Is my system AP or CP? It is neither, right? It's not CP, but is it more AP or more CP if we try to approximate? Well, this really depends when and where you look. In this DC, it is like more CP. In that DC, it's a little bit more AP. It depends on the time of day on the workload. You cannot classify this system by a single attribute, right? This is an AP system. You cannot. It's a hybrid system. Take VITES. So VITES runs on top of normal MySQL clusters and it has this SQL Interface Divinity Tablet which acts like a proxy. It is stateless. You send queries. It will break apart those queries and send them to the shards, the appropriate shards. Is that an AP system, a CP system? It is neither because it runs on top of normal MySQL clusters. Hence, it has the exact same properties. But if you set semi-sync and you send only read traffic only from the master and you have a reasonable failover mechanism, then you can get a very highly available system, a very highly consistent system, one that can scale because the load on the master is no longer an issue. You can always reshud. So you can get much of all the good things together without this system being neither available nor consistent. And that's like the dirty little secret of CAP. It does not have the power of E equals mc2. Imagine NASA wanted to send a spaceship that goes beyond the speed of light, right? Imagine this brilliant engineer stepping up and says, I know I'm going to make a trade-off. I'm going to build a spaceship that runs 10 times the speed of light, 99% of the time, and then the next 1% of the time is to run only half the speed of light. That can't be done because E equals mc2. You can do that. Sorry. But with CAP theorem, we are able to work around the limitations and come up with a highly consistent, highly available system. And we don't... Do we really need it? There's more. Let's talk about the big cluster. B48.014, thank you, Oracle, for changing my slides at the last moment. So, single writer in the cluster. Now we have Paxos, right? There's consensus. There's interesting stuff. Is this an available system? Is this a consistent system? Who says it's available? Who says it's consistent? It is neither available nor consistent, people. So, availability. Why is this not an available system? Because I could partition the network so badly that there will be no quorum. If there's no quorum, there's no leader. If there's no leader, there's no writer. No writes can be accepted. So, we can make this system unavailable on network petition. Is it consistent? It is not, because in the big cluster, B48.014 only guarantees, much like semi-sync, that the data gets durable in another node, but not necessarily applied. Of course, you can put proxy and route all queries to the leader and get very, very good consistency, but cap-wise, it's neither. With multi-writer nodes, you still get the same not available, not consistent system. And it's possibly even less available, because now nodes may refuse your requests because of conflict. And that's not good with cap. So, it's even less available if you have more writers. Does it make sense? Less and less. Galera. So, Galera is just as unavailable as in the big cluster, because I could partition the network so badly that there's no quorum, no leader, no writes. But it introduces the WSREP-sync-weight argument variable, which we can set to a non-zero value. Let's say it's 3, which means I could block on the reads until they've been applied... block on the writes until they've been applied on the other nodes, or block on the reads so they wait on those writes to apply. That sounds like a consistent system. And really, in practice, it's wonderful. According to cap, this is not a consistent system, because a single node can be isolated. And when it's isolated, it will take a timeout for that node to recognize that it's isolated. During that time, it will serve inconsistent reads. It will serve stale data. It could be five seconds, it could be nothing, right? According to the cap theorem. This is mathematics, right? We just proved that something can be done using an infinite network petition, right? So here's another use case. This is not a consistent system. InnoDB cluster, congratulations Oracle. 10 days ago, released a week ago, whatever. 8.014, with really interesting stuff. InnoDB cluster, it can now serve consistent reads. Not cap consistent, I'm sorry. It is not an available system because of the same network petitioning problem. And much like the Galera problem, if I isolate a single node, there will be a timeout, some amount of time, some small amount of time. Again, in production, I'm really happy with this. Cap theorem-wise, this is not consistent system. So this will serve stale reads. Now, within InnoDB cluster, we have an extra option that says, what happens to that isolated server? Does it just turn read-only and refuses queries? Or does it shut itself down? It can kill itself, commit suicide once it realizes it is not a part of the quorum or part of the cluster. So that doesn't make it more consistent, but killing the node makes it more available. So it's a bit ironic that by killing a node, which makes perfect sense, by the way, we make the system more available according to the cap theorem. Sorry? Okay. All right, I have to move on. I'll take this later. So far, we haven't seen a single AP system, a single CP system. Is there an AP system? Yes, there is. Take six masters, not connected to each other in any way, not replicating, not doing anything between each other. But this is an AP system. Of course, there's zero consistency and this is not interesting. A circular master replication is also an AP system, but we all know that in production, we are moving away from this setup. This is not something that I know a lot of companies have. It's just moved away from master-master replication. So the AP systems are not very compelling. And so, it's takeaways, and I'm running out of time, is takeaways. I'm not sure CAP is the right model for us. We just... I just took all these setups and said, not available, not consistent, not available, not consistent. Is that the bad thing for those systems? I don't think so. I think these systems are excellent. We are able to achieve most what we want from those systems. Maybe the problem is that we try to enforce the cap theorem on something that does not play well with the cap theorem. So we necessarily want to talk about the cap theorem. The definition of availability according to CAP is not something I feel comfortable with. It doesn't make sense to me. And so, please be wary of people waving the CAP theorem saying you can't achieve this because of that. Yes, there is a trade-off between availability and consistency, but it's not the one everyone thinks it is. It's a different trade-off. There's an extension to CAP. I'll read about it later. There's a bunch of things. There's a lot of information on the internet on the CAP theorem. I found many misleading or incorrect articles. So I'm pointing here quite a few links that are correct, I believe. And thank you.