 Hi, everyone. Welcome to the Protocol Labs Research Seminar Series. Today we're hosting Roddy and Alexander. They're about to give a talk based on their paper titled BlockSTM, Scaling Blockchain Execution by Turning Ordering Curse to a Performance Blessing. Roddy and Alexander are blockchain researchers and founding members of Aptos, and I'll let you take it from here. Thank you again for having me and thank you for the introduction and let me jump into it. So we'll present today BlockSTM, Scaling Blockchain Execution by Turning Ordering Curse to a Performance Blessing. And this is how we can execute 160,000 transactions per second on the Aptos blockchain. And this is a joint work with Watigela Shibili, Julian Chiang, George Daenerys, Zikun Li, Yu Chia, Runtian Zhu, and Dalia Malki. So I think for this audience, I don't need to explain or motivate too much why scaling blockchain execution is one of the most important problems now. But I do want to say that roughly speaking, every blockchain consists of three main components, which is the first is the execution, validators need to agree on the order of the blocks. And we know that this problem was studied very a lot recently and there was a few recent developments such as Milby, MilbyFT, and Narvel and Task. We know that we can execute more than 100,000 transactions per second. However, the end-to-end blockchain latency will be as good as the weakest component. And currently there is nothing good for execution. So this is what we try to tackle in this work. So in the execution layer, we have our validators after they already agreed on the sequence of blocks. They need to take these blocks one by one, execute the transaction inside the blocks, and then apply the final state to their storage to persisting the result of the transactions. So far what we see in existing blockchains, existing L1 from the engineering point of view is two approaches. One is to take just to execute everything sequentially. And this is of course not scalable. This will become very soon, not already a bottleneck in any blockchain system. The other approach is to try to use parallelism. However, most of the time, they require users to specify and give some extra information alongside with the transactions such as declare the dependencies or give some hints of what are the right sets. So of course this is not ergonomic for the users, for the user's point of view, but also it's also very limiting and not the best experience. There's movement to our goal in this work, which is to design a parallel execution engine that from the one hand will be transparent to the users. From the user's point of view, the user might think that everything underneath is executed sequentially. He doesn't need to do anything. But on the other hand, we do want an engine that will be adaptive to the workload, which means that if the workload is highly parallelizable, meaning that there are very few conflicts in the workload, then we can extract the inherent parallelism and achieve very high throughput. But not less importantly, if we get a workload that is very sequential, meaning that there are a lot of conflicts between the transactions, then we want to introduce as little overhead synchronization over as possible compared to the sequential execution of the workload. And of course there is some prior work on academic prior work on how to solve this problem. So the first approach that I want to present is called the minor replay, which was introduced a few years ago. So in this approach we have the minor. It can be a minor in permissionless settings. It can be a leader in permission settings. It doesn't matter. But the idea is there is one entity which executes the transaction first. And it can do it sequentially. It can do it with some black box other solution. And then after executing it, you can extract the dependency tag, the graph of all dependencies in the transaction between the transactions. And then it can send this tag to the other validators. Now they in turn can use this tag in order to come up with a perfect fault join schedule, which will extract the perfect parallelism. This approach has some limitation. First is that maybe the validators, they can execute very fast, but what about the minor? The minor is still slow. We need to go and execute the transaction first. And the overall latency is high. And not less importantly, there is an issue of trust. If the minor is based on team, why would validators trust him? The next approach is BOM, which is actually from the database context, in which they propose to execute the transaction according to a fixed price at order. Meaning that they a priori agree on the order of transactions. And a priori agree, so the execution, the final state of their execution should be equivalent to go and execute this transaction according to this predefined, agreed fixed order. If they can, and in this, in this work, they assume that they know all the, all the, all the, all the estimations of the, of the, of the right locations. So by, for example, by some static analysis and play execution. And if, if they know, if they know everything, if they know where each transaction is going to write, what they do, and they, they prepare, they statically prepare data structures before the execution for every possible memory location. And they prepare a slot for every, for every transaction is estimated, is estimated to write there. So when a, when another transaction wants to read from this memory location, it knows exactly what slot it should look for. So if, if, for example, in this example, the, the, the, the value seven is already there. So the transaction can continue. However, in this example, the value is not, is not yet there, but it's estimated to be there. So the transaction should wait. Otherwise, otherwise, yeah, you just have to wait. This approach required, requires perfect, or maybe over, at least perfect over, over, over estimation of the rise. And I'm not sure it's very realistic in the blockchain settings because, because we want to support arbitrary smart contracts. And many times we don't even know in advance where we're going to write. If we have an unestimated right, then the best, then the best bond can do is just to update their estimations and they restart from scratch. Another prior work, which is huge prior work, which is called software transactional memory was introduced first time, I think 30 years ago and studied a lot since then. And, and the idea is there is come up with a, with a framework to atomically execute arbitrary transactions. So from the user experience point of view, this is very good. What the user needs to do in, if you use a software transactional memory library is to specify a begin, the transaction begin, then it can write an arbitrary code, specify transaction end, and the library makes sense, make sure that all the code between the begin transaction and transaction is executed atomically. So, so the so the user doesn't not to do that does not, doesn't need to do anything with optimistic concurrency control. What's happening is that we have the pool of transaction that are statically mapped to Fred. So Fred takes a transaction, he executes it, then during the execution, he keeps tracks of read set and write set, then in it, then then after the execution and it's to validate the reads to make sure that nothing changed. So all the decision he makes during the transaction is still valid. And then if validation, if validation succeed, then he can commit, otherwise I need to, otherwise I need to retry execution. So this approach seems like, seems very good. And it seems like it can be a good fit for blockchains. However, there are, however, there are a few problems. So first, in practice, even though there are like 30,000 of papers, academic papers and software transactional memory in practice, they are very rarely deployed. And this is before you, this is because usually their performance is very limited compared to, to, to, to other, to other fine grained solution. And this is, this is because like keeping track of the bookkeeping and the synchronization over it is usually high. In addition, generally the outcome of STMs is not deterministic. So if we have two validators that execute the same block, it's possible that they will come up with, with different, with different final states because, because they are, because they are executing the transaction in different order. So deterministic and, and preset order STMs do exist. However, they treat, they treat it as a cause. They treat it as an additional limitation, limitation on constrain on the system. And as a result, their performance is, is, is even worse. And however, the good news about STMs that we know that they can be very, very efficient if they are, if they are applied for a specific purpose, a specific use case. In the next few slides, I'm going to try to convince you that blockchain is actually a specific use case that might be good for, for STMs. So first is the blockchain reality. And what I mean by that is that in, in, in general purpose STM, we need to commit transactions in the, individually, like each transaction right by one. However, in the blockchain, in the, for the blockchain, it doesn't matter. We can just commit the entire block together. So this can save a lot of synchronization overhead in, in, in tracking the individual commit, when can we commit a transaction with just, we can just commit the entire block. And another, another win from the block minority is the, is that garbage collection comes for free. Usually, usually STMs pay a lot of synchronization, although to know when they can reclaim memory and things like this. And here we can just do it previally in between, in between block executions. Another use case of the blockchain is the, is the, is the safety of the VM. So usually, again, in STMs, when we, when we use STMs, optimistically, optimistically execute transactions, meaning that some of the transaction can, transaction can read values that, that just wrong values. And in order to make sure that the program is not, is not, does not, does not come to some inconsistent states, STMs usually, usually pay a lot in synchronization overhead to satisfy a property called opacity, meaning that in every, in every read, every read that the transaction does, every state that the transaction does is, is, is, is legal. So here, here is one example, suppose that we have a problem with the following invariant in which X is always greater than Y. And if you see, as you see in this example, if the transaction here are executed sequentially, that this is variant always hold. However, let's see the following concurrent execution. So, so the green thread comes and can write X equal to then the blue thread, the blue thread goes and read X. Later, the purple thread goes, write X equal to three, but it doesn't matter because the green, the blue thread already, already read X equal to and then the purple thread goes and write Y equal to Y equal to next, the blue thread will go read, read Y and accidentally divide, divide by zero, which might, which might crash the entire program. However, because in the blockchain, in the blockchain use case, we already have a VM that has to guarantee and protect against any, any arbitrary smart contracts bugs, then we don't, we don't need to care about, we don't need to care about this issue. This is already provided by, provided by us, provided for us by the VM. So we can save a lot in synchronization overhead, not worry about it. The two first observation where I think it's very intuitively intuitive how they can improve our performance and save on synchronization overhead. Now the third observation is, is at least to me was it was completely contraintuitive. So, and the observation is that, that preset all their defining the order of transaction can be a performance blessing. And what I mean by defining the order of transaction is that the final state of the execution is equivalent to going and execute the transaction one by one in some predefined order. This is what I mean. And to be fair, this observation, as I already mentioned earlier, was already done by Bohm in the context of databases. But we believe that in this work, we take this observation into the extreme and use it in many, many, many, many places. Here's some intuition of why, why this might be helpful. So, so consider, consider to transaction X and Y. And if we don't have a preset order, then we don't know, we don't know a priori which transaction should go first. So, so if the purple thread goes first, then X is serialized before Y. However, if the green thread goes first, then Y is serialized before X. Otherwise, if they, if they, if they, if they both go concurrently, then it depends. And inherently, they need to solve some consensus like synchronization tasks in order to determine who goes first. And this is a lot of synchronization over it, at least intuitively. And when, when we, when we predefined the order, we don't need to solve this task. We just need to make sure that, that everybody knows the order and follow it. Now, now I think, I, I want to, I want to talk about, I want to start talking about block STM. And first, I want to describe the component, the system components. So we have the VM and we treat it as a black box. We give it a transaction, a single transaction, it executes it. Similar to bone, we are going to use multi-version data structures in order to avoid white-white conflicts. But differently from bone, we are not going, we are not assuming that we know the estimation. So we are not building this multi-version data structure statically at the beginning. Instead, we are going to, we are going to loan the estimations on the flight, on the fly, and we are going to build this multi-version data structure dynamically. Now, another difference from, from general-purpose STM is that we don't statically map transactions to the executor, the executor, executing threads. And so, so instead we have we have some collaborative schedule to which thread goes to get their next, next, next, next transaction, either to executor to validate. And we have the executor, which is basically the main loop, which control all the logic of the thread. So, so here is roughly how it works. A thread goes to the collaborative scheduler to ask to the, to get the next task to, to, to the next, the next task to perform. If it is an execution task, then the first goes to VM and ask the VM to execute a transaction. The VM in turn will go to the multi-version data structure in order to read. Then after the, after the VM finished the execution, the VM doesn't write back to the multi-version data structure. Instead, it returns a read and write set to the executor, so that the executor can, can track it for future validation. And then the executor, then the thread goes and apply the rights to the multi-version, multi-version data structure. Now, if the tax, if, if, if the, if the thread gets a validation tax, the tax task, sorry, from, from the collaborative scheduler, then it needs to go and revalidate the reads, all the reads it has in, in its read set, it needs to go and revalidate. Anyway, in both cases, execution and validation, the thread goes back to the collaborative scheduler and update it according to the result on the, so the the collaborative scheduler knows what are the next transaction that needs execution and revalidation. So now a few words about the multi-version data structure that we're using. The multi-version, the data structure itself is not, is not novel. We're using a standard, is not, is not novel. We're using a standard multi-version data structure for every, every location in order to avoid write-write conflicts. So what we do is that whenever a transaction wants to write a value, it just goes and it just goes and adds a new slot to the, to the data structure for this example, in this example, transaction number six. Now, when a transaction wants to read a value, let's say transaction number five here, it goes in the multi-version data structure and finds the, the, the index, define the value written by the, the highest transaction, transaction that is lower, lower than, lower than the transaction. In this case, transaction five reads the value written by transaction three. Now let's talk about the collaborative scheduler, which is, which is where the most of the, most of the logic of our block stream is. We need, we need a logic to find the next task. And abstractly, you can think of it as a, as a global, global queue of tasks, which, which threads goes and, and pick from. But also remember that we need to respect the preset order, right? We need to execute the transaction according to, to the, the preset order. So, so the collaborative schedule needs to make sure that we prioritize execution and validation tasks for lower transactions. In addition, validation must, must, must logically occur in a sequence and I explain it in a second. So, so, so here is an example. Here is our preset order, right? So first, first, first we go and execute, let's say concurrently, optimistically concurrently transaction one, two and three. Then we need to, need to validate it also optimistically. Concurrently we go and validate transaction one, two and three. Then assume that transaction one and three, they, they were good and validation succeed, but, but transaction two abodes. Okay, so transaction two has to be executed. And now, importantly, after transaction two is reexecuted, because we need to, because we need to go according to the preset order, in transaction three, need to be validated again and possibly again, possibly maybe, maybe also reexecuted. The collaborative scheduler optimistically dispatch both execution and validation tasks. And, and we know that successful validation doesn't mean, doesn't, doesn't mean safe to commit. However, fade validation, fade validation means, means need to reexecute. So the collaborative scheduler needs to keep, keep, keep, keep dispatching this validation and execution tasks and somehow eventually decide that, okay, no more tasks, we can commit the entire block. Another purpose of the collaborative scheduler also manage the dependencies. So, and we're, as I, as I said before, like, similarly to bone, we leverage the preset order in order to, in order to reduce the bolt rate. And the way we do it is that if transaction X twice reading a location, the transaction Y is estimated to read, to read from, and we know that X depends on Y according to the preset order. Then, then we need to suspend X and add, add X to the Y dependency, meaning that whenever, whenever Y is finally executed, then we can unsuspend X and block X and let it continue the execution. Okay, so before, before I told you that we don't, we did not assume anything about the estimation, right? So, so how exactly we compute them on the flight? Remember our multiversion data structure library data, sorry, multiversion data structure, we are going to use their boards to estimate the rights and use the multiversion data structure to track them. So, here is an example. Let's say, let's say transaction three executes, writes its value to the mult, to the data structure, and then at some point fails validation. So, what it goes and the transaction needs to go to all the places it touched, and all the values it wrote before, it goes and instead of just deleting them, it's going to mark them as estimation. So, so later, if a transaction later come, again, transaction number five comes, the value, the transaction number five needs to read is the value written by transaction three, because it's always loops for the highest transaction road to this location, which is lower than five. And then five comes and see that the value is not there. However, however, there is an estimation mark there, which means that transaction three is currently executing and is luckily, and it's likely to write here again. So, it's better for transaction five to just wait, because otherwise it will execute and do some just do wasted work, because probably it will need to fail validation and we will need to be executed. And this is how we can also have cascades of our boards. So, if you want to compare our approach of right estimation to, for example, from pre execution, right, we could try and learn the estimation by static and by just go and execute the transactions just from the from the initial stages go and execute the transaction just to see where they're going to write. So, in the good case, when we execute, so first thing we execute, we never do pre execution, we just execute. So, if we are lucky and we can commit the transaction without ever executing it, then we are not wasting any work, right. And the bad case, when we do need to execute and we are both at the first execution, then at least the estimation that we have is much fresher. We have them for a fresher state. And if we compare to the right estimate, the assumption or in bone, the right estimation are given. So, in BlockSTM, this is just an optimization, the protocol will work fine even if we will not estimate rights at all. The collaborative scheduler is really like where the main logic is for BlockSTM and the key for our performance is to implement the logic as efficiently as possible. We need a way to pick tasks with the lower transaction index. And if we just use older set or priority queue, which are available in the standard libraries, and then even if this older set have relaxed semantics, we know then in January they cannot scale because the queues, they have too much contention. So, we once again leverage the preset order and by and by using it, we are able to implement accounting-based very efficient concurrent order set. I'm going to give you now the high-level intuition. So, in order to do it, we have a vector, we have an array. Each slot in the array represents each transaction, is correspond to each transaction and represent whether the transaction needs to be revalidated or executed. And we also keep one index, which is atomic variable, which is a lower bound on the slot in which a transaction needs to be revalidated or executed. Okay, so in order to pick a task, a thread comes and just perform a fetch and add on the index. If the value that it gets, then it takes the value back, goes and reads the appropriate slot and see what is there. If it's empty, then there's nothing to do with this transaction. Okay, so it goes and fetch and add again the index. And in this example, it sees that there is a transaction to validate and it go and takes the task and go and validate it. Now, later, if we want to add validation or execution tasks to this order set, all we need to do is just to mark it in the appropriated slots and pull back the index. Now, in order to keep track of this, what transaction needs to be validated, what transaction needs to be executed, write this array. It's a bit more complicated. The transaction has a life cycle of states, which is go through. So first, it's ready to execute. Ready to execute means that, as it says, it needs to be re-executed. Then executed means that it's ready to be validated. Now, the reason we need also executing and aborting in between is because we need to make sure that at least one transaction, at least one thread can currently execute the transaction and at least one thread can currently aborting the transaction. So this is why we have to sort the races. We want only one execution and one aborting concurrently, at most ones. At least then at most together, it has to be done. In our implementation, we actually compare the log-free and the mutex-based implementation of it and we saw that we can use mutexes without having the performance. This is what we do eventually for the code to be more readable. This is because the fashion increment mechanism is actually provided very good load balancing on these slots. A few other things that I want to mention here is that the collaborative scheduler also need to handle the races between the adding dependencies and taking the dependencies. Also, when we decide that when we see a dependency, there are a few strategies that we can take and one is just to crash and re-execute later. The other one is that when we see a dependency, we just wait for a signal to continue and re-experiment with both of them and there are some trade-offs in performance. Of course, what I explained to you so far is a very high-level intuition and a lot more details in the paper, but I just want to touch very briefly some of the algorithmic optimization to give you some idea. First, we have two older sets. One is for execution and one is for validation. One thing that we want to do is that we want to validate as soon as possible. This will help us to avoid the scales of a board because the sooner we realize that the transaction needs to be revalidated, it also means that all the values it loads are bad and we don't want other transactions to read these values and continue executing. We want everything to just see dependencies and stop when we need to execute things. In order to do it, whenever we see an abort, we already schedule a higher transaction for validation. When we finish the executions, sometimes we need to schedule this validation again. However, we have an optimization here and if this execution doesn't write to any new location, meaning that everything it wrote before, everything it wrote now, it also wrote before, so everything was already estimated in the data structures, then we don't need to revalidate again. Also, we are trying to reduce the index as little as possible, so sometimes we just pass the task around. An execution finishes and instead of pulling back the validation index, if it needs to validate stuff, it just picks a task and starts validating it. At the beginning, I told you that one of the things that make us improve performance is that we can lazily commit the entire block together. We don't track individual commits, and this is how we do it. The collaborative scheduler also needs to keep track of all ongoing execution and validation tasks. Whenever a Fed observes that there is no ongoing execution task and simultaneously, there are no more tasks in the queues, meaning that both indexes are equal to the number of transactions, then it means that all the transactional can be safely committed. And in order to be able to simultaneously verify that these two property holds, we implemented the double-collect technique. As for correctness in the paper, we have a very long and formal proof of both safety and liveness. We basically show that the final state is equivalent to a sequential run of all transactions. And for liveness, we prove that there is no dead lock or live lock. If feds keep taking steps, then they all will terminate. What we show is both two conditions, we saw that if a Fed simultaneously sees these both two conditions, then it's safe to commit. And for liveness, we prove that eventually one Fed will see these conditions and notify the others. Okay, now let's talk about the fun stuff. So we implemented block STM and Rust in both DM and Aptos and merged it to the main benches. And from comparison, we also implemented on the DM code base BOM and LITEM. So for BOM, because BOM needs perfect estimation, we give BOM perfect estimation and we don't measure this time. And LITEM is a recent deterministic STM algorithm. For experiments, we use peer-to-peer transactions, both the standard transaction provided by DM and Aptos. We use two block sizes, 1K and 10K. And we experiment with two 10, 1,000 and 10,000 number of accounts. Note that the number of accounts correspond to the contention level. The more accounts we have, the less conflicts we have and the workload is more parallelizable. In these plots, we can see the results on the DM blockchain with standard peer-to-peer transaction, which consist of, they are not trivial. They consist of 21 reads and four writes. And here is important to see the comparison to BOM. So when the transaction, when the block size is 1K, we're actually better than BOM. Even though we give the BOM estimation for free and we don't charge any time for these, no penalties, we just execute, we just measure how much time it takes BOM to execute with these for free estimations. So we see that our block STM is slightly out to perform. BOM is when the block size is 1K. And this is because BOM needs to recompute the static data structures and it takes time before BOM start executing. And when we increase the block size to 10K, we actually see that BOM out to perform slightly block STM. And this is because the time it takes to recompute the static data structure are now amortized because the block size is bigger. So on this plot, I want to show you the comparison between block STM and sequential execution with different contention level. This was done on the Aptos blockchain with again peer-to-peer transactions, again non-trivial eight reads and five writes. So we see that the blue line, we see that when the contention is low 10,000 accounts, then we can achieve more than 160,000 transaction per second. When the contention is high 100 account, we still get very good numbers. We achieve over a more than 80,000 transaction per second. Now, not less importantly, when we have two accounts, two accounts basically mean that all the transactions have conflict among them and the best we can do is just execute them sequentially. To compare the red line to the black line, the black line is just a sequential execution of the transactions. Then you can see the block STM overhead is very little, which is exactly the goal that we wanted to achieve. We wanted our engine, our parallel engine to be able to dynamically extract the parallelism of every workload and to adapt. Just to conclude, there are a few possible extensions to this work. First, we can consider now nested transactions in order to deal with popular contracts. We can also try and combine block STM with the minor replay approach that they presented in the beginning. Let the minor run the block using block STM, extract the dependency graph and then send it to the other miners. We also didn't optimize our implementation to NUMA or hyper threading, so this is also something that we can do if we want to push the performance further. In general, I want to say that if we think about blockchain performance, so currently we already know that we can speed up the execution. We have a few recent papers, shows how to system, shows how to do it, and how to achieve over the 100 transaction per second. We have at least a few. However, if we want end-to-end performance of over 100 transactions per second, then we still need to improve the storage. We have the consensus, now we have the execution storage is the last thing we need to improve. Now, I probably will be able to do it, and in a very short time we will have a blockchain that will end-to-end run over 100 transactions per second, 100,000 transactions per second. However, if we think further to the future, like a few years ago, a few years from now, then we're talking about like millions transactions per second, then I think the best, we will need to find a very good way to shout the blockchain. This is not an easy problem, and I hope as a research community, I hope that we will be able to come up with some cool ideas and good solutions. Okay, I think I'm almost done with my talk. I just want to mention that we're in Aptos, we are now building a lab, we are hiring people, but more importantly, we are looking forward to where I can look for collaborations. So if you're free, we are really open to collaboration on this project and other projects. Thank you for having me.