 So, my name is Vittorio Livojeva. I've been working with group application and optimizing group application and on performance, basically, for the last time. So, what I'll be doing today is presenting some ways that you can, some options and some ways that you can optimize group application for performance. So, I'll skip this. You have seen it. And this also goes. So, I'll try to explain better than showing benchmarks and everything. And we all know that benchmarks have a purpose, have many purposes. But actually, for a production user, they mean very little because they need to run it with their workload. And there are just metrics that you can use to try to understand what's happening. But the real benchmark is the workload that we need. So, instead of that, I'll try to explain how it works behind the scenes in some things that impact performance so that then each person can check what the workload is doing. Okay. So, let me start with this anatomy of group application. So, in group application, we have basically two big parts. So, we have the transaction, and after the transaction is prepared, so it executes, and it's ready to on a node, and it's ready to be propagated. We, group application enters at this point where it's ready to commit. And then it gathers the right-side information. Then it does what I call is a throttling point. So, I'm going directly to these details because I think people here prefer this lower level thing. Okay. So, then we go to the details of delays if needed, and then we send the message out to ordering on group application as Alfredo was showing. Okay. Then, at this point, the thread just waits. It just waits there for something to happen. And that something is going, is the transactions going to the network being ordered in agreement with all nodes, and then getting back to the certification on that same node. And this is what's happening here. Okay. So, then once the certification result is in, then the thread can continue. And at that point, the certification is either positive or negative or so. So, it's, there is either a conflict or there's no conflict and we can go on. And then we decide to commit or not. This is basically the main entry point for group application. Which also means that if someone traces the, what's the application doing or something, why is group application slow? This is not where you'll be seeing it much. Okay. Because it would just send it to the network and then wait for something to happen outside. The second part is this one where, indeed, we have a loop that handles the data that gets in from the network ordered. And at that point, we have the, we receive the transactions, get them ordered, and we get a transaction one by one. And then decide to certify and then see, according to the state in each node, the JT executed in each node, and the information that comes on the right side from the network. We decided if they are, they are compatible that we can proceed with that transaction or if you have to abort. This is deterministic because everything, all the nodes receive the transaction in the same order. So, the decision is the same on all nodes. This is what makes the multi-master system work. Okay. But there's here a split which is, if the transaction is local, we don't have to do anything else then, okay, saying fine, commits, and go on. But if the transaction is not local, we need, and the certification is positive, we need to store that transaction to be applied. We don't execute the transaction immediately. We just reuse what's already available with a synchronous application which is a slave-applier. We put it in a relay log and then eventually it will apply. And this is the, the two main areas of group application, which means also that we have these factors affecting our, our performance the most. Okay. So, of course, the obvious ones, network benefit latency. Slower networks will be, will be harder to deal, but particularly latency, very high latency will make us drop performance. But this happens to everyone, so that's fine. You just have to adapt to that. And then the certification throughput also is an important factor because the, the certification is a sequential process. We do it in a single thread that decides what happens to the, to the transaction. It's not a light. It can do many transactions, but you will have to be careful with that because it may take longer and then we start delaying the acknowledgments because of that particular thing. Okay. And then, of course, the, the remote-applier, once they get to the relay log, they depend on the same thing that replication depends, which is the throughput on slave-applier and the number of threads on slave-applier that can actually perform the remote-applier. Okay. So, let's see what this means. This means that these points here are the main connection, condition point. That not, that one is not that much. It's a small point where we gather the write-set information. It takes a bit. We hash it, but it's very light, not something very significant. But then we have this process. We have to send the message out. We have to reach an agreement and then wait on a certification on that, on that side. Okay. So, this is the main condition point on the certifier. We have to make sure that the certification itself is able to keep up with the pace we have. And we have to make sure that the transactions get to the relay log at a rate that is enough to not delay the system. This can also be an issue. And again, be a player. So, we have a few options. We have a few options here that allow us to control these things. Okay. Some are not very controllable, but, for instance, if we have high latency, of course, we can put more transactions and we need to extend more, be able to put more transactions in the system and then they eventually arrive. And we keep a high throughput at the cost of increasing the latency on transactions. But at the point where we intercept the transaction, we have the transaction prepared. All logs are there, but then it's very lightweight. So, increasing this latency, you don't contend between threads a lot. So, it's fine. If you increase the transaction, parallel transaction, it will behave rather well. Okay. But then, if we can do something, which one is, okay, let's compress the bandwidth. We have a slower link. Let's compress it and just send it and take advantage of the fact that compression, compared to the network, can be at a higher rate of compression. So, we can mostly use the CPU to compensate the network limitations. We can also reduce, of course, the bin log itself and use a minimal, so that the rows that we send are smaller. And then this is this one, which I should, maybe I shouldn't put this here, okay? Because this is something for a very low level. If you want to tune in some situations, what happens is that we found this to be effective in some situations. So, it's here because it may be useful. But this is very particular, which is, if you have bursty situations where you have sent something, then you wait, then you send and send and wait, sometimes the thread that's receiving the threads will go to sleep. Scheduling out this thread that makes the reception delays the entry of the same thread again. So, if we can avoid this for just a bit and sometimes we see a big throughput because it never sleeps and this depends on the network. So, fine. Don't worry too much about this, but it may be handy in some situations, okay? Okay. So, the certifier throughput. We have to certify and, as I said, we have to write it to the relay log. This is then sequentially. We certify and put the transactions there. Actually, we tried a solution where we did this in parallel. But the benefit is not very big because in the end, you'll be limited by the throughput of the writing to this. Okay? So, in that, either you are able to end of that throughput or you improve just a bit, but then you'll still be limited by that throughput on this. So, take care of where you put your relay logs and if the system is indeed capable of handling the throughput you want. Otherwise, if you may have nodes that delay the certification compared to other nodes and that can be quite a delay if you don't take care. There's also another issue that may happen when you have multi-master nodes, write to multiple masters, which is if we try to send the same g-tids, use the same to use the g-tids sequentially in two different nodes, we try to get to one of them. What this makes is the g-tid-executed sets will be very large because they are not able to compact. The g-tids sets compact very well if they're continuous. And if you have two nodes and they are trying to send the same thing, they will be building larger g-tids sets. And this is important for the certification process. Okay? So, if you use multi-master, please do not put this to one. Otherwise, the performance will really drop as the certification itself, certification info will grow more than you need actually. The consequence of this, which is not a big problem, I think, but it may be in some situations, is that the g-tids will not be contiguous. So, you have the g-tids from one node that will have an interval. The g-tids from another node will have another interval. Then when those intervals are exhausted, they'll get another interval. So, you have g-tids in blocks when you write the multi-master. Okay? But this is something that, by default, it's already one million, I think. So, but please don't put this to one. And then we have the applyer. Applyer site. Okay? The applyer site, of course, we have to, if we want to use the parallel applyer, we have to use the logical clock scheme, the scheduler. And we have to have enough threads to handle the workload. Fine? It will depend on the system where we are running this. And if possible, use more than less. But after a point, it just doesn't pay to have more threads. And that point, maybe between 8 and 16, and depending on if it's really a right intensive, 24, 32, but then we start having condition on the distribution also. So, one good thing about group application in the applyer is that we take advantage of the information that's used for certification to improve the parallelism on the slave. What this means is that we already decided which transactions are compatible and not between nodes. We know that from the certification. So, we also know which rows in the transaction are compatible between themselves. So, we use in group application, we use that information to schedule the transaction on the slave applyer. Okay? And what this means is that we can ramp up much faster on the slave applyer using the right set information than using the binary group commit. Okay? A synchronous application which uses is that it mostly groups the transaction that are in the same group commit. It says, okay, if they all commit it together, it's because they were parallel on the server and use that information to run it in parallel on the slave. But if we have a very fast storage, very, very fast storage, and only a few threads, the group commit without having some delays inserted will be very small. Only a few transactions will commit together. And in that case, we'll have little parallelism on the other side on the slave applyer. If we use right set information here from group application, we can take advantage of this fact that we don't need the group commit to decide what needs to be parallel or not. So, even with only a few threads, we can get already a lot of the throughput of the slave applyer. Okay? This is one of the benefits and it's something that probably reduces lag in a situation where we have lag right now with a single service. Okay? So, one other thing that we decided that, of course, the layer also has it, but here it's important for us to keep the nodes mostly, and I mean mostly close, but not exactly in sync, because at some point we need, if we let the master go at full speed in some situations, there's no way the slave can keep up. And if we have a group where more than one member try to write, it's very difficult for them to write effectively if they are not close. But for us, so we have to introduce flow control for this also, but there's also some situations where we want to be able to manage the cluster without, like for instance, one of the important ones is to add a new member to a cluster that's writing fast, so we have a lot of right workflow there, and we want to be able to get in the cluster. But the work of a node that gets in the cluster is much larger than the node that's just running the cluster, because it has to store what's coming in from the network and also applying what's on the queue already. So, for this, we introduce this flow control, which is one thing that works a bit differently. Okay, I'll skip the performance graphs. Okay, so on this side we wanted to say, okay, it's the writer that decides, we instead of doing like a layer of those, send a master or something like that to the server in the same delay or something like that, no, we just send, each node sends to the server one message per second saying, okay, this is my queue, this is my applied queue, this is my certified queue, this is the number of transactions I tried to execute in the last period, this is the number of transactions I have successfully applied and so on. And then everyone that is listening on network knows the state of all members on the network. And with that, whenever a member wants to write, it cannot write more than the state of that the system can withstand. So, if we notice that the slowest member on the last period that has a queue growing beyond a threshold that we set, if that queue is larger than we decided it was the maximum, then that node is delayed and we check how much it was able to run and then with that we simply, okay, so it was able to extend 1,000, let's take 10% for allowing it to get the old transactions and also try to keep it around this 1,000. And then next second we do exactly the same, we try to always see if the state changes and then we ramp up if everything is clear then now we are very fast and then we ramp up and then again we always do this one per second. So, whenever a writer wants to write, it will check the quota that's available for him as a writer and that's it. So, you won't see any flow control messages other than this stats message going around. Okay, but okay, flow control also introduces some changes, but actually the disabling flow control, actually in terms of throughput does not decrease much. It decreases much if you put low thresholds. If you put very low thresholds this is designed to have a threshold that's significantly larger than the number of transactions it can run in a second, for instance. So, it should be larger than if you handle 10,000 transactions, it should be larger than that, otherwise it will be doing more throttling than it is. Okay, so, that's it. Do I need to finish now? If needed I can skip. Okay, so as I said before, this is just an artificial benchmark, not the mean CIS benchmark, which I like a lot. It was great for us in development, but let me just show what you can expect in this configuration and with this compared to asynchronous replication. Okay, of course you have a loss compared to asynchronous replication. We start where asynchronous replication would just commit and we start our work there. Of course we have higher latency, those are the triangles there, but the thing is, well, we believe that this is reasonable. We are, of course, trying to improve on this and we have overheads, but it's great for us that we reach already this point, but we want to make it closer and, of course, this is only for CIS benchmark, let's see how it behaves on all the real users and workloads. Okay, so, but at least it's good for us to be able to reach this point. There's also the issue of multi-master, so, of course we do not recommend it immediately, but you can use it and using it there's some issue about the scaling with the number of writers, so there's a bit of scaling here, but there's no big scaling from using multiple writers when you have a system that's fast enough. Of course, if you have many reads, there are only a few writes and, yes, of course, you are exploring the read capacity in each node and then the read capacity will be enough to have some gain, but, well, that's it. It's something you need to study if it's benefits or you are not, or if you can use it, and then we also see the growth from three to nine members, which is something which is also good. There's some effect there on five and seven members, but that's okay. So, there's no big drop and there's no big drop because of what Alfredi said. The XCOM layer then, GCS layer, that's below has a capacity to handle this, so even growing to nine members, we still are able to handle this well, and this is, well, that's easy. Okay, and then there's also another thing that Kenny was saying that you should not use WANs. No, please, we just, what we wanted to say was that it's not as optimized for WANs as we can do, and we were working on that, but you can use it in WANs perfectly. There's no issue there, but of course, what you are doing, if you put this one node with some delay in a three group, of course you'll delay, but then you compensate that by moving the line to the right. So, you will reach a throughput that's higher, but with more threads, and the same with the 50 milliseconds, and that also grows a bit more this way. So, and this is also, over time, the throughput, but this is not sustained throughput as before, this is peak throughput, this is the result on suspense, and we see there are small dips there, which Renee already complained about as put the work there. This is the garbage collection that sometimes enters and lowers slightly the throughput there. Okay, we'll also improve on that. Well, and that's it. Okay, so right now we are waiting also for some feedback and trying to understand if all the workloads behave, and if we can indeed take advantage of this. Okay, so the thing is, and I can explain this, the pipeline that we use on the communication system, as a number of slots, and those slots are configured, so having a very large latency will decrease your throughput. It will work perfectly, but then you have to know if you can put enough threads in parallel, so if you can withstand all of that latency there. So, fine, but you need to know if your application can really withstand such a large number of threads. Yeah, no, the transaction, one transaction, sorry. In terms of execution, no, if you put enough, if you use more transactions, you'll have more parallelism on the server, yeah, so yeah, yeah. No, no, no, no. Okay, any more questions? So, you're saying that we are using paxsus, mentions, right? So, if I recall correctly, paxsus wait for majority, right? So, in your graph, when there was one node of three that was, let's say, as latency, was it because you were writing to it, or is it because even when, let's say, you write on only one node where the majority is fast enough, it slows down or not? So, that's the things we're actually working on, but the main issue is that we have to write to the network. So, for one transaction, that's what happens is we reach majority. If we have two nodes locally and one remote for one transaction, we execute immediately. There's no more delay there, but the thing is this works for one transaction. When you add many, then at some point, which is the horizon, which is then it will start delaying. It will also start delaying immediately send large transactions. So, if you send small transactions and you send only a few threads, it should be able to have very low latency, but then when you have more, then it will have that added latency. Okay? So, yes, it should be as paxsus in because we use that, we should be able to immediately return because we have two nodes that are low latency, but right now we can do that only as one case instead of being in all situations. That's what we need to fix. Okay?