 Next up we have our final presentation of the scaling and performance session titled, I believe it's pronounced carry scalable BFT consensus with pipeline tree based dissemination and aggregation by Ray Neheiser. My name is Ray Neheiser. So it was a slightly different pronunciation but even people in my own country get it wrong. So that's about I'm going to present to you our joint work curry scalable BFT consensus with pipeline tree based dissemination and aggregation, which was one of these SOP papers as in many of the tags today. So there is a lot of interest for permission blockchains, and there is also a decent interest for large number of consensus participants, and there's just like, relatively famous example are DM that stated in the white paper, or also quarter. Now, most of these systems rely on BFT consensus and most commonly on some kind of PBV derivative, or, or hot stuff. PBFT works relatively straightforward, where there's a leader that sends the message a proposal to everyone, everyone then votes on that that goes for several steps. And then there's a agreement on a given value or a set of values. If the leader fails, the leaders just switch to a different process. And after F plus one steps there, obviously is going to be a correct leader, and the system will be able to achieve that. And as such, BFT is a good example, or the derivatives are a good example, because they offer high resilience, low latency, and optimal reconfiguration as such an F plus one steps they will be able to reconfigure correctly. One of the problems of BFT, BFT derivatives are that they have a high message complexity due to the credit quadratic message complexity. Hot stuff is one of the process that tries to solve that a little bit. So instead of having all the members broadcast the messages, the, there's a center process that disseminates and aggregates the messages. So it has twice the number of communication steps per phase, and relatively similar to be BFT if the leader fails, the leaders just switched. And again after F plus one steps, we have a correctly robust system and consensus can be achieved. So it has the same high resilience is able to drop a little bit of the message complexity and still achieve optimal reconfiguration, but like it said, it now has twice the latency actually. And one of the problems of these solutions that they're actually inherently non scalable. And it's because one or more processes have to send receive and process all and messages and that leads basically to a bottleneck in terms of both bandwidth and CPU. And there are a bunch of alternative in the literature that tried to solve that, and that are on the on the basis of these algorithms either committee based solutions or solutions that are based on dissemination and aggregation trees. And committee based solutions are relatively straightforward way instead of having all the processes agree on the value we see each round, select a subset of processes, and that subset of process will do it in the name of the others, and then propagate the result. The problem is that there's still a decent chance that in one of these rounds, a majority of incorrect processes on there. So that can cause certain problems like in some cases low resilience, or in many cases also cause non deterministic safety. And committee based solutions avoid this, and such that they're similar to hot stuff where they create just a different communication architecture, where certain processes will relay the messages, and that way, we are able to dispute both the communication load, and the processing load, while maintaining the same resilience as BBD and hot stuff and the overall same guarantees. The problem is that these approaches are relatively hard to configure. On top of that, they have a problem of high latency like hot stuff already has twice the number of communication steps per phase. Now a tree has based on the depth for six, eight, 10 times more, and which leads to problems in terms of throughput as well. And in terms of reconfiguration just doing the same approach of trying a different leader won't do it because depending on the number of internal nodes and how the faults are distributed in the system, we might never find a correct tree this way and actually there's even a factorial number of different trees where only a small percentage is correct. So it's a really hard problem to solve. And in terms of the latency, while the messages propagate through the system, the majority of the time the majority of the processes are just idle waiting for the next message or waiting for the answers to process, which basically is a big problem because the the actual resource utilization is very low, so the throughput is much lower than it actually could be. And that's actually what we tried to solve with Calorie because Calorie is a tree based approach so we also use dissemination and aggregation trees, but we tried to solve several of those problems. And that's the following main challenges where we have optimal reconfiguration for low F number of failures, we compensate the extra latency through a pipeline lighting scheme, and we still offer high resilience. Let's first check out how we do the reconfiguration. Now, let's assume we have a generic tree like here with a fan out of five. And as long as our number of failures is smaller than the fan out. So how we will be able to reconfigure a tree such that we construct a robust tree in optimal steps. So, as we have the fan out here, we have to construct more bins than we have failures. The bin contains then the number of internal notes, so that in this case we have six internal notes. So we about build F plus one bins of six or more notes, and then relatively straightforward. If we then replace each bin, each number each internal notes with a pin, eventually we will pick the bin without any faulty notes which has to be there since there are more bins than faulty notes, and we'll be able to build a robust tree in optimal steps. For latency, we do a similar approach as hot stuff does for their problems but we extended further. So, as the leader sends the first message out, the leader already knows the hash of the previous block so the leader can also construct the second block, and the third block, and in the meanwhile pipeline several blocks optimistically through the system. The problem is how many messages can we actually pipeline. And so we have to look into how to how we configure this actually, because if we choose to small values we will underutilize the resources and have a lower throughput than necessary. So we need large values that will lead to congestion and the latency for the client will be much higher than necessary. So we need a performance model to configure curry properly. So, there's basically a total time that each round takes and the time is more or less dominated by the hop latency, so time, the propagation takes, and the computation at each step. And of the total time, there's a small time, we call the idle time so this basically the total time minus the time the process at the root needs takes to send the messages, and then minus the time it needs to process the messages. And when then if we have the total time at the idle time and the process time, we can then calculate the pipelining stretch which is the number of additional blocks, we're going to process in the system. And if you want to hear more of these details we have them in the paper that's going to be published and SSP. And if you want to have it beforehand you can send me a message on slack and I can give you a preprint. And we evaluated this on grid 5000 with up to 20 physical machines and did a executed the number of different experiments. So I quickly describe a small subset which are the most significant ones. And this particular experiment was executed with three different sizes of validator sets, namely 100 200 and 400 in a setting of 100 millisecond front trip time, and 100 MB links so that's, for example, inner Europe or inner US setting. And curry is presented with the two blue lines one one with and one without pipelining and hot stuff presented in the orange and red line with the yeah. So we configured the pipeline in search of Corey according to the previously presented theoretical model, resulting in a stretch between four and six. And so we can see on the x axis the throughput and a number of processes and the wire throughput. And we can already see that there's a pretty big difference between Corey and the start rate based approach where Corey actually has an throughput advantage of up to 26 times the throughput of hot stuff. And that is even that is due to these inherent scalability issues that I mentioned earlier with the large validator sets. And that also shows that the non pipeline version of curry above 200 processes already performs better than hot stuff and at 400 and already has doubled the performance, even though the high latency should be a relatively large problem. In the second experiment, we did a different setup where we had 200 milliseconds round with latency and 25 megabyte links, which is a setting, which is a geographic blockchain setting, which can also be found like in the algorithm paper for example. And for this experiment we vary the block size between 32 K and one MB, and then check the maximum throughput we can achieve. So we have on the x axis the throughput and the y axis the latency for this. Now hot stuff is relatively straightforward. But basically when we increase the block size, the latency also increased because the system starts bottle necking and the throughput increases as well until it reaches the maximum the system is able to handle. Now encourage results a little a little bit different. So first of all, if we have very large block sizes and the performance, we can't pipeline as well because we can just pipeline 0.1 blocks. So that's why we have at the top part of the graph, a little reduction in throughput. Similarly, because we only had 20 available server physical machines at very low block sizes, we are computing so many blocks in the system, such that it was up to 25 concurrent blocks that the performance degraded a little bit and we had actually a lower throughput than we actually could and a slightly higher latency because of the concession. The key takeaway of this graph is that there is a, the increase of latency is actually much higher than the increase, increase of light is much higher in hot stuff than the increase of latency in a calorie. So that in certain scenarios, even though we have to treat with the twice the number of communications that in this example, Calry still performs better than hot stuff in terms of latency in a number of the scenarios. We did a small test where we evaluate the impact of failures on the system. So we measured this in the same setting as the previous experiment with 200 millisecond round trip latency and 25 megabit links and configure the system in a way such that the current leader fails after 40 seconds and consecutively after being elected the next two leaders fail in the same manner. At this contribute a low number of failures, we see that Cori not only configures in a small time period as hot stuff, but it's also able to rescale up the pipelining to attain a similar throughput values before. So concluding Calry is able to easily scale to hundreds of processes in comparison to the state of the art. We outperform previous work by a factor of up to 28. In one of these scenarios, we have no resilient straight ups and achieve optimal requirement for a small value of F, which is arguably the most common case. There's much more experiments in the actual paper. And we also have any prototype available on GitHub if anyone is interested. Thanks everyone for listening and I'll be happy to answer any questions.