 This afternoon, we're going to talk about group application adjourned to the Group Application Communication Core. The idea is to show what's behind the group communication or the group application, sorry, and talk a little bit about Paxus and how our Paxus-based implementation works. I think you already saw this slide, but I'll let it here for a while. I'll talk a little bit about myself. I have joined Oracle at the time until I was son, eight years ago. I've been working basically on application and HA for the last eight years and a half. And before that, I was to work on database as a consultant and basically on Postgres, Oracle, Microsoft, SQL server, and before joining, Son and Oracle have never worked on MySQL before. She has been a pleasure to work with a lot of great guys and to develop an application to improve application for the last eight years. So our agenda. The idea is to give a little bit of background on what the plans are regarding application and specifically about group application. And then we're going to dive into what's behind group application, specifically about our Paxus-based implementation. And then I'll show a little bit about performance, but just a sneak peek preview because after me, there will be a guy named DeVeter that you dive on performance and you explain everything that you can and you cannot do with group application in terms of performance. So group application is a tiny piece, a tiny but important piece within the MySQL iNodeB cluster. The idea of the MySQL iNodeB cluster is to be a fully distributed database solution. So we'll have the router that will be responsible for delivering or sending the query to a specific replica sets. A replica set is powered by group application. And you can, as you can do now, with a synchronous application, you can plug your synchronous application to our group application core here. And our group application solution is called synchronous application, but I would discuss a little bit about that because I really don't like the name because it's not really a synchronous solution, but I'll talk a little bit about that in a few minutes. So although it's fairly new, you can still use all the MySQL ecosystem solutions. You can use the synchronous application, plug it into our group application solution. You can use semi-sync application. So you can still grow your database, you should grow your cluster using the asynchronous application solution plugged along with the synchronous application solution that you provide with the group application. So you have also an important piece of this puzzle. It's called MySQL Shell. That's responsible for orchestrating our cluster. You can use the MySQL Shell in a user-friendly way to add nodes, remove nodes for the cluster, and clone cluster. Basically it's an admin tool to make things easier to the end user. And of course, in order to reach one of these replica sets, you need to keep track of the members that have joined the cluster and have left to the cluster so have metadata on this information. It's our future. It's not really fully implemented yet. And you can see a sneak peek preview of the solution in labs, but you are working hard to make this idea, to make this solution, eventually GA. So in the core of the MySQL and Adobe cluster, you have the MySQL group application. It's a multi-master update everywhere application plugin for MySQL with built-in autonomically distributed recovery, conflict detection, and group membership. And the key point here is that it automates failover. You don't need a script. So that's when a master fails, have to failover to a new master, have to pick one of the labs that has the best GDT. Everything is fully automated when you use our group application solution. You don't need to think about it. And it also provides multi-master updates everywhere. You can write to any master, although you're not recommended this yet. There are some still rough edges between our group application solution, and I know to be specifically about gap locks. You actually work on making the gap locks work in sync with our group application solution. But if you control your workloads, if you have ideas on how you can split your workloads between the servers, the multi-master you work just fine. And it provides full tolerance. So that's when a node fails. Automatically, the cluster will detect that the node has failed. And if you are using the single primary solution, it will elect a new primary automatically. So it's about automation. It will hide all the complex about failover from you. And you don't need to have a script to do a lot of things and having seen some talks this afternoon or this day that script can do bad things when combining them with users. So and what's group application? So you have three important pieces in group application. Have the replication plugin is the core part, the core piece of the solution. And have a bunch of APIs to talk to the server. So everything that the group application needs from the server is provided through a set of APIs. And you have an API underneath it that is used to propagate information. That's case the information is the transactions to remote nodes. How it works. During the execution of a transaction, there is no interaction among the nodes. Everything is done locally. But upon commit, you get all the things that the transaction has done. And you propagate all these things to the remote nodes to all the havocers. And then the magic will happen. Exactly what happens. When the chance propagates to the remote nodes, there will be a certification process that will be responsible for verifying if there is conflicts among possible concurrent transactions that have been executed on other nodes. If there is a conflict, the transaction will be hold it back. If there is no conflict, the transaction will commit. So that transaction, remote transactions is committed through what you call an applyer. An applyer gets all the updates that you call a write set. And you inject, we will apply these updates on the remote nodes. Of course, in the originates machine on the machine that the transaction has executed, you just need to reply back to the end user saying, OK, the transaction has been committed or the transaction has been aborted. And everything is integrated in MySQL. It's integrated with iNodeB. Of course, it's our first GA solution. It has been GA since December. There are a lot of things to be done. But the idea is to provide a lot of performance schema tables so that you can monitor the solution. There are a few performance schemas already provided that you can use to check what's happening inside the cluster, what's happening within a specific node. So underneath, that is this group communication API that's responsible for propagating the information to the remote nodes. And we're going to dive a little bit in this API. So this API is quite simple. It's a thin layer. The idea is to hide all the implementation details on the group communication system itself. In the beginning, our solution was based on CoroSync. It worked fine as a prototype, but it had several issues with CoroSync. And I highlight some of the main issues that led us to choose another solution. I've decided to choose an in-house solution. And in this thin API that sits between our group communication implementation, the core of the group communication implementation, and the message bus was really important to make this transition easier, or is in our case. So in the beginning, there was CoroSync, but I've decided to move to our in-house implementation. And I will talk a little bit about it. So it's a thin layer, but it has to preserve some important profits from the group communication system, whatever the group communication is. So I have to provide total order in the sense that when you send a message to a remote node or to remote nodes, it has to be totally ordered among all the messages that are being sent by concurrent nodes. It has to preserve safe delivery, which means that I will only allow a node to get these messages to deliver these messages when a majority of the nodes get it. And this was one of the key issues that you had with CoroSync. CoroSync didn't provide safe delivery, which means that a node could get a message, then do something by these messages and fail, but the other nodes wouldn't get it. So basically it means that you could have inconsistent neural cluster. And that's why I have decided to use our in-house implementation. Another important point about the group communication system and that this thin layer has to preserve is the idea of view synchron. Basically view synchron means that whenever there is a chance in your cluster, whenever a node is added, whenever a node is removed from your cluster, the information about the membership has to be totally ordered with the message that's sent through the cluster. So it's quite important to decide when a node joins, when a node leaves. Those information has to be totally ordered with the message that I extended in the cluster. So it's a key property of a group communication system. And this usually called, these view synchron properties is usually called virtual synchrony. And you may have read somewhere about group communication, about synchronous application, about Galera. And they use the virtual in the name. And in my opinion, at least in my humble opinion, it's not really a good idea because the virtual comes from the virtual synchron which basically means that nodes that have left the cluster, nodes that have joined the cluster, this information about chance in the cluster have to be totally ordered with the messages that have been sent, that are being extended. So there is nothing to do with synchronous. There is nothing. And there is no correlation between synchronous application and virtual synchrony. I don't know why people come up with this idea and this name, but it's wrong. And our solution is not really synchronous. It's not synchronous application. And I'll explain, as I have said before, why it's not synchronous in a few minutes, in a few slides. So our group communication engine. So our solution is based on boxes. It was initially based on Corosync, but we have decided to use our in-house boxes implementation. It has nice features. It has compression, built-in compression. It has the ability to run different platforms, basically all the platforms that are supported by MySQL. And Corosync only supported Linux. So it was basically a no-go for us with Corosync, in that sense. It has support to dynamic membership, in the sense that it can change the cluster when everyone can add nodes, can remove nodes from the cluster. It has support to SSL. So if you need encryption, you can use SSL. And I forgot to mention something about the, let me go back one slide, sorry. So I forgot to mention about closed group. Closed group means that a node is only allowed to send a message to the group, to the cluster, if he has joined the group. So it's basically about security. So security is very important to Oracle. It's very important to MySQL. And that's why you have this property of closed group. So that's why you have great support to SSL. And you also have IPWideList, and you can specify which nodes are allowed to join your cluster. So there is no third party software required. You don't need an external component to run group application. You don't need an external process to run group application. Everything's title integrates within the MySQL product. And you don't need your, don't have support to Mootcast. So it means that you can run group application in cloud without problems. And most of the time, you really don't need Mootcast. I think Vitor, you talk a little bit about this in his presentation about performance. So I have decided to use Paxes, but there are a lot of Paxes variations. Moot Paxes, Fast Paxes, Disk Paxes, and we have picked one specific solution. In fact, we have started developing our in-house implementation. And then later, you came across this paper that is called Dimensions, or this protocol that is called Dimensions. So our solution is pretty similar to Dimensions, but was not initially based on it. And there are a few difference between our implementation in mentions, but if you want to pick one protocol, one paper, that has a lot of similarities to our solution, I would say that it's mentions. So to understand how mentions work and how our solution works, you need to understand a little bit about Paxes. So Paxes is really, really, really simple. And although the implementation is really trick, to get it right, to get a good performance is not easy, but the idea is quite simple. So I have notes that have an important, that have rows in the protocol. These rows are, they are proposers, they are acceptors, and they are alertness. You have phase, I'll explain each phase individually. And usually all the members, all the nodes have the same rows. They are all proposers, acceptors, and alertness. And the masses that are sent through this, in our case, are transactions. And in order to make progress, you need a majority. So if you have three nodes, if you have three members, and at least you need three members, you can tolerate one failure. One of these members now in these slides can fail, and you still can make progress because there'll be two nodes. So I'll explain a little bit about each one of these phase so that you can understand our Paxes-based implementation. So the first phase, this prepare phase, or leader election phase, based one of the nodes, one of the proposers, you'll become a leader. And the idea here is that if there are many bosses, there is no progress. If many people or several people are bossing around, there is no progress, as in real life. So we have to pick one of these members, one of these nodes, as a leader. So this is the first phase of the Paxes protocol. Basically Paxes send a message saying, I want to become a leader. One of the nodes you send this message, usually the node with the lowest number, in this case, in our case, in our example, zero, member zero. So I use the same set of images through all these slides. The dark blue or dark gray, I don't know, whatever, it's a leader. The other ones are basic acceptors or learns. They can become a leader eventually if the current leader fails. So the member that wants to become a leader, you send a message with a ballot number, with basically a token, say, I have this token, number N, and I want to become a leader. Everybody that gets this message will reply back if it didn't receive a previous message with a higher number. So basically to promise, okay, from now on, I will, you are going to be the leader, and I will reply back to you as long as anybody else send another message with a token or with a ballot number higher than you. Everybody will make a promise to follow that leader. So you only need the first phase to elect a leader. When there is a leader, as long as there is a leader, you don't run the first phase. You go always to the second phase. And the second phase is the core of the protocol, is the dissemination of the data, of the transactions to all the members. So the leader, in the previous phase, in the previous phase, you get information from the members and it will also check if someone got a previous message. If that it was a previous message, it will reuse it. Otherwise it will propose whatever it has in the queue to be proposed. So it will send a new value if nothing else happened. And the protocol requires just a single round trip to get an agreement. So basically to say, okay, that's the next transaction that we will commit in our group communication solution or group application solution, sorry. So the last one is the learned phase, okay? You get an agreement, you got an agreement, that's a transaction, the next transaction to be will be transaction X. Then the leader, you reach the agreement and the leader needs to inform the other members that there was an agreement. So there is a third phase, the learned phase, where the leader will inform all the members that something has been decided. And usually you can pick back the learned message in the previous message in the accepted phase so that you don't need really to send any specific message saying, okay, that's the next transaction, the next message that has been decided. So you can basically pick back this information in the other, the previous phase of the protocol. Time's up. Okay, really? Okay, so sorry about that. And okay, so you don't really want to commit one single message, want to commit a single message. So all the messages will be kept in a queue. So we'll have a sequence of messages. So we're going to commit the next transaction, the next transaction, the next transaction. If a leader fails, we need to elect a new leader. So, and why? You have decided to create your own Paxus-based implementation. Because the leader can become a bottleneck. Don't want to send a transaction to the leader and then the leader, you send the transaction to all the nodes. It's really, it's not really good. So that's why you have decided to create our own Paxus-based implementation. So in our solution, everybody is a leader. Everybody will commit their own, will commit a transaction in their own slots. Of course you have to organize these slots in a sequence. So let me skip a few transactions. And there is a few slides. So there is a key issue here in our solution. So if a node has nothing to propose, you have, this node has to inform the other nodes that okay, I don't have anything to propose. So I'm gonna skip my turn in these slots in the sequence of messages. And there is can, I mean, this, you can use a simple learn, you can use a simple learn message to do that. But you still have to inform the other members that you have, you don't want to commit anything. You don't have any transaction you kill. So let's move to this slide. So current, you recommend only the primary single solution. So there is no direct mapping between our primary solution and our Paxus-based implementation. In the sense that if a node fails or if a node has nothing to say, has nothing to commit, it still needs to send a learn message and this can become a problem. So in our layer, in our thin layer, you immediately expel a node when a node fails. Otherwise, every time you'd have to run the full Paxus implementation to get an agreement for that slot. So a few optimizations. Of course, you don't need to commit a transaction after, sequentially. You have pipeline, you can commit seven transactions or you can send threads, you can run the agreement protocol, the consensus protocol in parallel. You have batch. This is not really exposed to the end user right now. You have plans to do that. You have compression and just to finalize or to finish some numbers. With multiple writers, you are able to reach more than 100,000 transactions. That's pretty good. And this is a key information so that when Victor presents numbers about group application, you are going to see that we have room for accommodates transactions that come from my scale. Our group implementation solution has room for growth. Latents, let me skip this one. Okay, sorry. Let me jump to the conclusion. So group application, our group communication solution has become GA in December. There are a lot of things to be done. Basically, you have decide to use Paxus just to be able to play with the idea of leaders, acceptors, and learners. And you have big ideas and big plans for implementation. So we stay tuned because it's gonna rock. Thank you and sorry for the other time.