 Well, hi everyone. Welcome to today's Protocol Labs Research Seminar. Today, we're joined by Krissa Stathacapulo who recently joined the team at Chainlink Labs as a researcher where she's helping smart contracts reach their full potential. Krissa was a doctoral student at IBM Research and ETH Zurich and also earned her master's at ETH Zurich. Today she will be talking about state machine replication skill ability made simple. So thank you so much Krissa for taking the time to join us today and I'll let you take it from here. Hello everyone and thank you for having me here today. So today I will talk to you about a paper that was recently published in Eurosis and it is a work that we did with people that you probably know, Matei and Marco, while we were all together working in IBM Research and the title says state machine replication skill ability made simple. So the talk today will first cover briefly the problem we try to solve state machine replication and then I will try to justify the title of the presentation. Namely, I will try to explain why it's skill ability is not simple. Then I will present our solution which we call ISS. It stands for insanely scalable state machine replication and it is a protocol that multiplexes single leader protocols that solve total order broadcast into a scalable solution that solves again total order broadcast. And while such a multi-leader total order broadcast protocols may already exist, we want to emphasize that our solution is both efficient and modular to give a spoiler. The results we achieved with this protocol is that when we implemented it with non-protocols such as PBFT, hot stuff and raft, we managed to make them scale on 128 nodes with a performance of more than 50,000 requests per second on a wide area network with one giga pps connections. And I would also like to highlight that I will not talk today about a sharding protocol. So this performance is while maintaining a common total order among all instances. But before jumping into many details, let's take a step back and see the problem we try to solve in more detail. So the problem we try to solve as it's clear by now it's state machine replication, but it all starts with a simple state machine. Let's say that we have a client that wants to submit a request to some remote service and such request has a payload and some identifier. The payload is the operation that will be executed by the service to go from an initial state to some new state. And once this is done, the service will respond to the client, notifying them that the operation was done by assigning a sequence number to their response. This is a very simple solution. However, it has an important drawback. It has a single point of failure. So what people went on and did was to make a replicated state machine where we don't have only one server, but we have multiple servers. And as before, the client is supposed to send out requests and receive responses. However, we want to make sure that we tolerate situations where one or more of those servers not only crash, but they might perform arbitrarily in any way. And this is what we refer to as Byzantine for tolerance. And what we actually want to achieve is to make this replicated service act as one server so that it's seamless with respect to the client. And the client receives the same response, whether it will send to one server or multiple. So this is built with a set of properties, the agreement property that makes sure that all servers assign the same sequence number to a request, the integrity property that makes sure that when a sequence number is assigned to some request, this request was indeed sent by the client, the liveness property, which makes sure that the client will for sure get back a response with the sequence number and the totality property that makes sure that the client will get the same response from all the remote servers eventually, if they are correct. And these properties describe a problem that is known in research as total order broadcast, where the terminology that we use is that the client broadcasts a request to the server, to the service, and the service delivers a request to the sequence number to the client. When the nodes now build a log of such requests from potentially many clients with many sequence numbers assigned to all the requests, we see that we can create an ordered log of this request based on the sequence number. And this construction now might seem very familiar because basically it describes a blockchain. And because of the applicability to blockchain systems, it is particularly interesting to solve a total order broadcast under specific assignments, namely partial synchrony, which describes the network communication. This means that we usually assume that the network is asynchronous, so that the messages might be delayed arbitrarily. But from time to time, we assume that the latencies are predictable or the network is synchronous. And it is interesting to solve such a problem in a wide area network. State machine replication is not a new problem to solve. It has been around for many decades. However, its applicability to blockchain systems has posed new requirements. As we said, we want to solve such problems in wide area networks with many nodes where we require a high throughput, low latency, even when we operate with faults, and even when we operate with Byzantine faults. So the protocols that have been suggested so far in research, in the beginning, they had all a common denominator. They are single leader total order broadcast protocols. And the problem with single leader total order broadcast protocols is that they have a very important bottleneck. Now, to explore this bottleneck, imagine that we have some clients here submitting requests to some nodes that solve the problem. And we will assume that the requests are grouped in some batch. So node zero here is the leader of the protocol. And what this node has to do is to disseminate these batch to all nodes. We call this phase usually proposal or preprepare phase in such protocols. And then the protocol may proceed in multiple rounds depends on its actual implementation. But the problem is already in the first phase because these batch that the leader has to disseminate is very big. And what is worse is that the more nodes we have in the system, the more data the leader needs to push through the network. And since the leader has a limited bandwidth available, this one becomes a bottleneck, which makes the throughput fall very steeply. And what actually we observe is that the throughput drops inversely proportional to the number of the nodes. Now, to resolve this single leader bottleneck researchers suggested parallel total order broadcast protocols. This means that we have more nodes acting as leaders where more nodes try to send requests through the network at the same time. However, still this solution has an important drawback and this is duplicate requests. Let's examine an example to see what this suggests. So as before, we have a client that wants to broadcast some requests. And for liveness purposes, this client cannot depend on setting this request just to one node who will act for it as a leader because this node might crash or it might even be something and censor such request. So if both nodes 0 and node 1 act in parallel as leaders, what will happen is that they might put both the requests are 0 in their respective batches and therefore they will both waste resources trying to push this request through the network. And this, when the client would send the request to multiple nodes will have an effect on the throughput similar to the effect that single leader protocols had. Namely, we will again have a throughput which is inversely proportional to the number of the nodes. And what is important to observe in this situation is that we cannot know whether the client sent the request to multiple leaders because it was a liveness issue or whether the client needed to do so because it was slow and he did not receive a response for us, for example, from node 0 or whether the client is also malicious and wants to exploit this vulnerability of multi-leader protocols and deteriorate the throughput of the protocol. So to address this solution, again researchers came through and propose protocols such as MIRBFT and FNF. MIRBFT is a previous work that again is done with Marco and Matei to address this problem. And to address the duplication issue, what is done in MIRBFT is a careful partition of requests among the party leaders such that each leader only sends out one request at a time. And then we reallocate this assignment periodically to still maintain liveness. And while this solution indeed demonstrates that it has much better scalability than before applying the duplication mechanism, it has still some issues. In particularly such solutions, they are designed to parallelize specific protocols. MIRBFT is designed to parallelize PBFT and FNF is an effort to parallelize a protocol which is similar to Holstaff. And moreover, they highly depend on a single node, which is known as the epoch primary in the protocols that move the protocol from configuration to configuration when something goes wrong. And this epoch primary can severely impact the performance of the protocol when it is malicious. So our solution that we wanted to achieve with our new protocol, ISS, was a parallel leader protocol to tackle performance with duplication prevention to maintain good performance in scale. But we also wanted it to be simple and modular. And our simple design, our target was to be able to multiplex not only a specific PBFT protocol, but a specific protocol such as PBFT, but any leader-based, total order broadcast or consensus protocol. And we wanted also to do that without depending on an epoch primary to avoid difficult situations when this primary class is always presenting. Here's in a nutshell how ISS works. What we want to do basically is simple. We want to create a totally ordered log of requests. And the log is represented with sequence numbers. And when the client broadcasts requests, some leader known assigns a sequence number to this request. We say that the request is committed and added to the log when, excuse me, we say that the request is added to the log when it is committed by the underlying protocol instance. And once all positions prior to the sequence number of the particular requests are filled, we can deliver the request. And at this point, the request can be executed because we know all the requests that are executed previously in the ordered log. So simple as that, we want to fill a log with requests in a totally ordered way. And what we do in order to handle this with a multi-leader protocol is to partition the log to what we call segments. So segments here are shown with the rectangles that go around different sequence numbers of the log. Because what we present with a segment is basically just a set of sequence numbers. So each segment now is assigned to a particular node that acts as a leader for this segment. And at the same time, this node acts as a follower for all other segments. And for each segment, we run an instance of what we call sequence broadcast. And sequence broadcast is a novel primitive that we came up to for the purpose of this work. It is similar to total order broadcast, yet it terminates for a set of sequence numbers. So it terminates for the sequence numbers in the segment. And importantly, it is implementable with consensus or total order broadcast so that we can reuse already known protocols from the literature to put them together with ISS and have a high throughput solution. As I mentioned earlier, an issue that is important for multi-leader protocols is request duplication. So to avoid this, we adjusted the duplication mechanism that we designed for mere BFT such that we map to each segment of the log a different subspace of the space of some hash function that we refer to as bucket. And now each bucket is assigned to a segment. And it's a node that acts as leader for this particular segment is the only node that can propose requests that when hashed map to this bucket. So for example here, when a client sends to multiple nodes request R0, such that R0 is hashed to the blue part of the request hash space and therefore falls into the blue bucket. Regardless which node received this request, only the request who is currently responsible for the blue bucket here node 0 is able to put this request in a batch and instantiate the and use this batch through sequence broadcast to totally order it. Now we are not done because this protocol is efficient, but it is not live. And this is because things can go wrong and the nodes that are responsible for some segment can crash or be Byzantine and censored requests. So we don't want to rely on a single node for handling the requests that fall to a particular bucket. And therefore we further partitioned the log into epochs such that we can reassign buckets to nodes for different epochs. So for example, here after sequence number 11, we start a new epoch. And now the which is partitioned again in segments, but now node 0, but node 1 is responsible for the blue buckets. So if node 0 had crashed or would censored the request, node 1 will be able to propose the request from the blue bucket. Now let's examine a bit more how we handle faults in ISS. So imagine again, our example where node 0 crashed. And at this point, node 0 has only managed to assign a request to sequence number 0. What will happen is that we will assign some other nodes temporarily as a leader of the blue segment here to make sure that this segment terminates. So to make sure that some value is assigned to all sequence numbers of this segment. And in particular, we enforce a rule that says that the new leader can assign only some special bottom value to the sequence numbers of the grass segment. And the reason we did that was to allow this segment to finish as soon as possible. And not delay further the progress of other segments that might have already finished so that we can pass to the next epoch as soon as possible. And in the next epoch, since we realized that node 0 was not able to finish the segment, we can exclude node 0 from the leader set and assign all the segments among the other three nodes. So for example, in the next epoch, node 1, we lacked as leader for both the yellow and the blue segment. And while this in this picture might seem a little bit imbalanced, keep in mind that in practice, we had many more segments that nodes so that we can better load load balance the segments of the crash node once it's gone. Now let me demonstrate a bit more in detail how we use ISS to multiplex our SP instances. So ISS as a black box seems very simple. It has a simple interface. As we said, client broadcasts, requests, and ISS delivers requests with sequence numbers. And what ISS keeps inside is an ordered clock of actually batches. And we decided to use batches to amortize computation and commutation complexity. So when a request is received is mapped to some bucket. And in the epoch, we spawn multiple SP instances. And the SP instances inside are implemented with one leader-based protocol such as SPVFT, HODE stuff, or AF. Those were the protocols that we actually implemented to evaluate the protocol. But it could be in principle any leader-based protocol. So once a request batch can be formed, which depends on some timeout or on some maximum batch size for the batch, we can invoke broadcast for the sequence broadcast primitive, which happens for multiple SP instances in parallel. And once the batch is delivered by the sequence broadcast, we can put this batch into the log. And as we mentioned previously, once all the positions are filled before some sequence number, the batch can be delivered. And at this point, all the requests in this batch are delivered. And the total order of the request is specified by the total number of requests delivered up to this point, which we know because we have delivered some batch for all the sequence numbers up to this point. And also the relative position of the request in the batch. So we evaluated our ISS protocol with a go implementation. We implemented, as I mentioned, PBFT, the changed version of HODE stuff and draft as the underlying sequence broadcast. And we evaluated ISS on a wide area network that spanned over 16 data centers all over the world, where we had uniformly distribute up to 128 servers. Its node was a fairly powerful node with 32 or 32 gigabytes memory and one gigabit ps. However, one gigabit ps is not an insane bandwidth. And we had also a varying number of client machines varying from one to 16 based on the load we wanted to send to the servers. We evaluated the protocol with 500 bytes per request, because this request, this request size is similar to a Bitcoin transaction. And the operation that the main operation that was done by each server upon receiving a request was a signature verification for client access control. And as I mentioned, the request requests were ordered in batches, but the batch size differs according to the best configuration for each of the underlying implementations. And it also differs to better allow the protocol scale according to the number of nodes in the network. So in this figure, I will discuss how ISS better makes the underlying single leader protocols to scale. We present here in the x axis the number of nodes and in the y axis the throughput in 1000 requests per second. And the data points on this plot are the saturation throughput. So basically for each of the configuration in numbers of nodes, clients progressively send more requests per second to the nodes until we observe that the throughput does not increase anymore and the latency increases rapidly. So at this point we did use that the protocol is saturated and the highest throughput is the throughput reported on this plot. This line is the evaluation of PBFT without ISS and it indeed behaves as we expected inversely where the throughput drops inversely proportionally to the number of the nodes. Whereas the ISS PBFT implementation reports significantly higher throughput. And we see a similar situation for hot stuff. Also hot stuff does not perform worse even than PBFT for a small number of nodes. This is particularly related to the way that hot stuff changes proposals and makes it a latency bound protocol. And as soon as we add more nodes as leaders and we have more hot stuff instance running at the same time, we can see that hot stuff performance improves significantly. Finally, RAFT also has a performance that vastly deteriorates with the increasing number of nodes when it is a single leader implementation. It is important to point out for RAFT that it is a protocol that in wide area network we observe that it does not behave very well because it again is very sensitive to latency but in a different way. RAFT has a heartbeat mechanism basically and when the leader does not see a response from a follower, it resents to this follower the proposal. So if the latencies in the network are a bit unpredictable, RAFT will send the proposals to the followers again and again. And to avoid this, we would, we tried to impose higher timeouts between different proposals. But still the single leader implementation would not scale very well because of the single leader bottleneck. Still the multi-leader implementation we see that performs quite well. Now to examine also the latency of the ISS counterparts. Oops, here I'm missing the legends. I'm sorry. The blue line here represents PBFT on 128 nodes. On the x-axis we see the throughput as the rate from the client's increases and on the y-axis we see the latency. So before the saturation point, we observe that PBFT has few seconds latency. Similarly RAFT, which is the orange line has few seconds latency and hot stuff as well. We can observe here that hot stuff has a significantly higher latency. And this as I discussed earlier is related to the fact that hot stuff is a latency bound protocol, which affects it significantly in wide area networks. However, in any case, we achieve thousands of requests per second with ISS with below 10 seconds latency under an average load. Then I would like to discuss a little bit how ISS behaves when we have crash faults. Let me explain a little bit these pictures. Here we see an average throughput for short periods of time as the time goes on on the y-axis and on the x-axis we see time. The periods between dashed lines represent an epoch of the protocol. We see on the red arrow how throughput drops when we have a leader crash. Also, this is an evaluation of ISS PBFT with crash faults. So for the left picture, we see what happens when we have the leader of one SP instance crashing at the end of an epoch. And on the right side, we see the opposite edge case, which is what happens when a node crashes at the beginning of an epoch. Now, what we observe is that the crash at the end of the epoch on the left-hand side is a worst-case scenario because when this happens, we have to wait to resolve this leader crash before moving to the next epoch, at which point we have a very high peak in throughput because we order the request for which the crash leader was responsible for, and then we move on to the next epoch. And because if you recall, we have a blacklisted the leader that crashed in all the subsequent epochs, we do not observe any further throughput deterioration because of the crash leader. On the right-hand side, however, we observe a more friendly scenario where the leader crashed towards the beginning of the epoch. And we can observe that in this case, we do not have a drop in throughput significantly because while we do not basically have an idle period for throughput, because while we were trying to resolve the leader change for the PBFT instance that the leader crashed, the other instances progressed normally. So at the end of the epoch, we do not have to wait anymore, and we can move already to the next epoch. Whereas, explained previously, we have removed the crash leader from the leader set, and we do not encounter any leader changes anymore. So this concludes the description of the protocol on my behalf. You can find online our paper for more details and the source code which we used for the evaluation. And you might be already familiar with the name ISS because some version of it is developed by Protocollabs. So that is for now. All right. Well, thank you so much, Chris.