 Okay, hello, this is the talk for formal verification and performance simulation in real-world applications by Nicola Berry, the CTO of Stellar Development Foundation and me, Hidden Oration, huh? The software engineer at the Stellar Development Foundation. Nicola won't be here for the presentation, but he'll be joining us for the Q&A session after the talk, and I'll be doing the presentation. So I wanted to start this talk by talking about what Stellar is, because that's what everything is based on in this talk. So what is Stellar? It's an open network for storing and moving money. If you're really into cryptocurrencies, you might have heard of Stellar Lumens. The network has the open membership policy so anyone can join the network. The code is open source and it uses a decentralized protocol, decentralized, meaning there is no one single organization checking all the transactions or verifying transactions. It's decentralized, unlike traditional financial institutions. There are a couple of main use cases. One of them is low-cost transfers between digital currency and fiat money. Fiat money is in US dollars or Euro. And another use case is cross-border transactions between currencies. Stellar Core is, in a way, it's the core of Stellar, but what it is, is that it's a replicated state machine that maintains a local copy of a cryptographic ledger and it processes transactions against an inconsistent with a set of peers. So this is the code that every node in the network runs among other things. And in consensus with a set of peers, this means that each node in the network has a set of peers that they trust. So they configure and they decide what nodes they trust. And with some threshold, once most of the peers that they trust agree with the transaction, that node in turn agrees with the transaction. What I just described is actually a very oversimplified version of the stellar consensus protocol. It's a federated consensus protocol. The idea is that because those lists overlap with each other, in the end, every node in the network agrees with the same transactions, which is again the stellar consensus protocol. And Stellar core is written in C++ 14. So as the stellar network deals with money, there are a lot of things that we want from the network. And the two of the most important requirements are scalability and security. So we're gonna start with scalability, but an example is like, what if the number of nodes in the network double? This is something that could happen in the network, related to scalability. And some things that could go wrong are like, maybe some nodes are gonna get too many messages once there are too many nodes in the network, or maybe they don't get enough messages or maybe they just go out of sync. Those are some things that we definitely need to think about related to scalability. In terms of security, a question that we might ask is, how do we update the protocol to adapt to a higher transaction volume? Maybe we wanna make the protocol faster. But then when we make a change, we wanna make sure that it'll be immune to Byzantine attacks, Byzantine as in bad actors. Maybe some people try to do something funny as this is a financial network. We wanna make sure that the network continues to process transactions. So this is related to guaranteed conversions. And we also wanna make sure that the new protocol that we introduced converges faster than the current protocol. So those are things that we think about related to scalability and security. So with these two major requirements that we have, we're gonna tackle those two from two different perspectives. One is a practical aspect related to simulation. And then the other one is a theoretical aspect, which is formal verification. So we're gonna talk about how we tackle those two major problems, scalability and security of the stellar network from those two perspectives, practical aspect and theoretical aspect. And those two might seem like very different topics, simulation and formal verification. But there is once important lesson that we learn from those two projects. And that is the importance of improving the iteration speed to maximize learnings. So the idea is that every time you go through iteration, you learn a lot, each iteration teaches us a lot. And if you can have a faster iteration speed, that means there are more learning opportunities. So improving the iteration speed is really, really important. So the question that we continue to ask ourselves throughout those two projects was what is the simplest thing? What is the simplest solution that represents a significant step towards the full solution? So sometimes we're not sure if the full solution is the right answer. Sometimes we just don't have the time to actually do the full solution. So, but then at the same time, we don't wanna do anything like that's not super useful or kind of useless or like stupid. So we try to find something that is very simple, a simplest solution that is actually a significant step towards the full solution. And then by asking this question, we were able to like improve the iteration speed to maximize learnings. So this is gonna be the theme of this talk. We're gonna talk about those two projects, simulation and formal verification. And then we are gonna talk about how we apply this principle to those two projects. Okay, so the first approach to scalability and security that would be the practical approach related to network simulation. So before we talk about the approach, we're gonna talk about problems that we're trying to solve here. So there are many challenges with a stellar core network. It's peer-to-peer network. So it's a little similar to like file sharing software. And this means the connectivity topology is fairly complex. Notes can configure decide who they prefer to connect to and that leads to a fairly complex topology. The latency could vary and the bandwidth between peers is a little hard to predict. There's also a little control over which versions run like anywhere, like they don't call us and then tell us what versions they're running or what computers they're running the software on. There's very little monitoring visibility. We don't have the capability to like AB test or rollback. So those are some of the challenges that we face with the stellar core network. And we decided that one approach that we can take to these challenges is to simulate this, like let's simulate as much as we can. So that's what we're gonna talk about here. The setup that we have is this. So we run everything on AWS using Kubernetes. We put a copy of core in a container in a Docker container. So each core in container represents a node in the network. And then they talk to each other but then this topology, we get that from the actual network. But anyway, so those cores or like containers talk to each other and then we export metrics and we monitor those metrics using the combination of Prometheus and Grafana. But then this is hard. Running one core node isn't the most trivial task. And we're talking about running like hundreds of them at the same time and simulating the real network by like sending transactions and stuff. So this can be like overwhelmingly difficult. So we're gonna ask our favorite question of the day. What is the simplest significant step that we can take? That is a good step towards the full solution. So the first thing we did was to split the problem. We split the problem in two different parts. So the first one is transaction subsystem. And then the second one is simulation that focuses on the distributed nature. So the first part actually requires a fairly complicated database setup. And we can actually improve the subsystem with like a fairly standard performance and stress testing. So we do that too, but then for this particular project we decided it's not gonna be a part of this project. But then the second part, the simulation that focuses on the distributed nature like how many bytes do you like send to each other? So this is what we decided that it's gonna be the topic of this project and also this talk. So what does it mean to focus on the distributed nature? So the first example is that instead of applying transactions, we just sleep. And this is possible because we skipped the database stuff. So we don't have to worry about applying transactions and committing them and whatnot. But then when we sleep, the sleep duration is modeled on the network analysis as multimodal distributions. So we checked how long we spend applying transactions. Each that we checked different types of transactions. And for each type, we checked how long it takes to apply a transaction. So we took that data and decided like how long we wanna sleep. So it's gonna be like pretty close to what we actually do in the real network without actually applying transactions. The second simplification we did was instead of like a real world transactions, nodes just send the same type of transaction. So in the actual network, for obvious reasons, there are many, many types of transactions and then they can be fairly complicated. Maybe in some cases it's harder to generate. But then we just pick one type of transaction that's like fairly easy to generate. And then we just like send up like each node in the network to send stuff to each other. But then that transaction, we changed the message size, for instance. And also we have a transaction rate and then those two numbers are based on the actual network analysis. This is actually pretty cool because the still network is like blockchain technology. So all the transaction history is like publicly available. So you can actually just download the past transaction history and decide like what would be the ideal transaction rate. What would be the most realistic transaction rate based on publicly available information which I think is really cool. The last example is that we started with a very simple topology. So this is another example of simplification. So we started with something very, very simple. And then as a next step, we surveyed the actual network and we decided to simulate that by copying the topology of the actual network. The third step would be to simulate the future network by scaling the survey data. So we asked questions like, oh, what if we had a couple hundred more nodes than we do now? So that was another example of simplification and starting simple and then working your way out. So after going through so many iterations, what we have right now is that we can create a network based on the public network topology. So we can survey the public network and then create that in the cluster and then have them talk to each other. And then while during the simulation, we can monitor the metrics of each node. So we can check how much CPU and memory each node is using. And then we can also check the consensus latency. Being able to check the CPU and memory usage of each node is really, really powerful because this is pure network where it's with their open membership and whatnot. So it's practically impossible to check other nodes CPU and memory usage unless they take a screenshot or something and send it to us. So being able to do that in the simulation and then being able to get some idea of like CPU and memory usage of each node is extremely powerful. So this is an example of a thing that we could, we learned through these simulations. This is something that happened a couple of weeks ago actually. So this is a screenshot of one of the dashboards that we have for the simulation. And in this particular simulation, and then in this particular metric, scp.timing.nominated, which is for the purpose of this talk, let's just think of it as like delay in the transaction or consensus. So basically the higher the number, the worse it is. We don't want the number to be too big. And if you look at this chart, you can see that there is one blue line that's really, really high. It's like almost 40 seconds whereas like most other nodes are like close to zero. So that's a little concerning. Like, you know, what happened to that node? Like, does that node have too many connections? Is that node overloaded? Like is the node in sync or out of the sync? So those are the questions that we ask when we see something like this. So because this is a simulation, we were able to investigate that node. And the first thing we did was like, what about the CPU and memory usage? Like, is it like overwhelmed? And it turns out that it wasn't. And then what about the incoming traffic level? Like, is this node overwhelmed by like all the messages that are incoming? And it turns out that that was not the case either. But the problem was that that node had too few connections. In fact, that node in this particular simulation only had two connections. So when those two nodes that it's connected to had some delay and if they both have some small delay, they got like somehow like magnified and this node ended up being like delayed a lot. Then like this makes us aware of like many things. For instance, like do we want to support this functionality? Like, you know, should nodes be able to connect only two nodes and stay in sync? By default, Stellar Quark recommends that like nodes connect to I think the eight nodes. And that seems a little more reasonable to me than two nodes. But those are the questions that we can ask based on this investigation. Next steps for the simulation project. The first one is simulating future growth. Like what if the network has more nodes or more transactions? The node count stuff is actually really interesting because we can see the current network pathology and I see it as a graph and then we can do things like what about the degree distribution? It's very like graph theory thing to do but or we can also think about like how many hops does it take to get from one node to the other? So I think it's called the diameter in graph theory. So when we have more nodes in the network, it seems reasonable to me that the degree distribution won't change that much. Like just because the network gets bigger doesn't mean that all the nodes in the network will have a better machine that supports more connections. But it seems to me that once we have more nodes in the network, the number of hops it takes to get from one node to the other, that's gonna change. So those are the questions that we asked. We asked when we think about like what would happen when there is some growth in the network. Another thing that another next step of this project is network behavior in different situations. So we wanna look into anomalies at the node level and also traffic pattern changes. So those are the next steps of this simulation project. Okay, so the second approach that we have to tackle the security and scalability questions is theoretical approach and that is formal verification. So for this formal verification project, the idea is that we wanna use IV to create a formally verified model of the stellar consensus protocol. The stellar consensus protocol is the protocol that the stellar core code uses. And the goals of this project are, there are a couple of them. The first one is to provide another way for developers to understand the protocol. So we have like the white paper, we have blog posts, we have like videos and presentations and stuff, but it always helps to have another way to understand the protocol. It's, yeah. And then the second point, second goal was to verify a new change to the protocol. And you're again, the two main questions, security and scalability. So with security, when we change the protocol, we wanna ask questions like, does the network function as a proper financial network after the new change? Like will it like prevent like double spending, for instance, will it be immune to these anti attacks? Like when some nodes decide to do something funny, like will the network be immune to that? In terms of scalability, does the network, does the new protocol work with the bigger network? So unit testing some stuff, there is a limit to the number of nodes that you can check. But then if we formally verify that the protocol works with any number of nodes, that's a very powerful thing to know. So that's the type of thing that we wanna do with this project. And I'm aware that IV and formal verification may not be the most well-known topic out there. So I figured that I would do some demo to show what it's like to work with IV and formal verification. So this is an object called bank, and this is a implementation of object bank in IV. And this is a very, very silly example, but hopefully this will illustrate what it's like to work with IV and formal verification. But this bank is for Alice and Bob. And it's a fairly simple implementation. It has the balance for Alice and the balance for Bob. There are both of type money. The type money is defined above. We don't really care about the details here. But anyway, after init, so this is after initialization. So in the very beginning, the balance for Bob and the balance for Alice are set to $10. So this is a very generous bank. It just starts with giving out $10 for everyone. And then after that, we have these two actions, spend money Alice and then spend money Bob. So this is also a fairly simple action. So it takes X, so X dollars as the input and then it checks that Alice has at least X dollars in the account and it subtracts X from the balance. So basically Alice basically like withdrawing money or like spending X dollars is what this action does. And then we have the same thing for Bob. Bob takes X dollars and then we subtract X from the balance for Bob. And you can see that there is an issue here. We don't check that Bob actually has X dollars. That's not good. Like bank shouldn't be giving out money like that if they wanna succeed. But anyway, so here we have this thing called invariant which states that the balance for Alice has to be at least greater than or equal to zero. And the same for the balance for Bob. And this is probably the most important thing in banking operations. They don't like hand out money unless they actually have the money. So that's something that we write down here. A couple of things to note is that we say that money type is integer. We don't really care about what this interpret means but the point is that money is integer. And then the whole thing makes sense. This is a bank object where the balance for Alice and the balance for Bob are money, which is integer. We start with $10 in each of their accounts and we have a way for Alice to spend money and then Bob to spend money. But the code for Bob has a bug in it. So the problem is that this invariant that the balance for Alice and the balance for Bob is always at least zero dollars. This is not necessarily true because Bob can spend more money than he has. So let's see what IV says. So the idea is that we provide this implementation to IV and then we tell IV that we wanna make sure that this invariant is always true. And then we don't give IV any unit test or anything but we just tell IV that, hey, here is the implementation of bank and here is the property that we wanna check and without any unit test or anything, can IV check that? That's kind of like what we're gonna do here. So we're gonna ask IV to prove that this invariant is always true for this bank implementation or it's not. So when we run this, something interesting happens. So for line 32, which is the line that we have, the invariant, it says fail. And that's kind of like what we expect. It says searching for small model and it starts with balance of Alice being zero and then the balance of Bob being zero. And it says the call, the function, spend money Bob and then call it with X equals one. So Bob spends one dollar. Then IV says that, hey, the balance of Bob will become minus one and that's exactly how we are gonna break the invariant if we ship this implementation. And IV is able to find this without any help. We just provide this implementation and this property and IV is like, hey, this is not good. This invariant can break. So we're gonna do something a little interesting here. So we're gonna fix this bug, FX equals. So we're gonna check that the balance of Bob is at least X dollars. And that way Bob can't spend more money than he has. And we're gonna run this again. And now it says, okay, and everything passes. And that means IV takes this implementation and it takes the property that we wanna prove and it says, hey, based on this implementation, this property is always gonna be true in that the balance of Alice and the balance of Bob will always be at least zero dollars. And the IV is able to do this without like any help. It can just take the implementation and the property and then prove or disprove and then give your account example. So as you can see in the bank example, there are two steps to formal verification. The first step is to create a model. So in the bank example, we just implemented a silly example, silly Alice Bob bank. But the idea is that you wanna implement the protocol in IV. And then the second step is where you list properties to prove. So in the bank example, we said that, hey, this invariant that both balances will be at least zero, that's what we wanna prove. In the case of the stellar consensus protocol, it's gonna be a little bit more complicated. Some examples are, we wanna say that all nodes are processed the same transactions. Another one would be all well-behaved nodes, ignore that node submitting illegal transactions. For instance, if some nodes are trying to submit transactions like that are basically the double spend, then we wanna make sure that good nodes will ignore them. But then this is hard. Consistence protocols are hard and formal verification is also extremely challenging. And then putting them together doesn't make it any easier. Formally verifying a consensus protocol is overwhelmingly difficult. Then this is when we wanna ask our favorite question of the day. What is the simplest step? What is the simplest significant step that we can take that is a good step towards the full solution? So what we did was to split the problem. We split the protocol in three parts, federated voting and the nomination protocol and the ballot protocol. So federated voting is kind of like a building block for the nomination stuff and the ballot stuff, ballot stuff, it's a prerequisite to everything. And a lot of two are kind of independent from each other. So we decided to split the protocol into these three parts and then work on each of them separately. Another thing we did was to limit the search space. So in the bank example, I'm not sure if you remember, but the money type was integer. You can imagine that if the money type was like a five-bit integer, it'll be fairly easy to prove any property because there is only gonna be like 30 possible values for each balance. So limiting the search space is extremely powerful because you can either brute force or something like brute force or Ivy can do that for you, but yeah. So limiting the search space before expanding it back later can be a very powerful tool. It doesn't always work, but it can be a very powerful tool. So what we did was to restrict the number of nodes or the topology or number of transactions. We actually did all of them and we decided to expand it back later. So what does it mean to simplify the problem? So we started with this ultimate goal, which was to model and prove the properties of a network which implements the stellar consensus protocol, relatively complicated protocol. The network could have as many nodes as possible, who knows? It could have bad actors, ill-behaved nodes and it processes any number of valid or invalid transactions. This is a fairly complicated task. It's fairly overwhelming, but then after simplification, we decided to start with dealing with a model that implements a part of the stellar consensus protocol and it only has two nodes and they're both good nodes. So they don't do anything funny. And also it processes only two valid transactions, just two. And after that, the network just stops working. To put this in a perspective, in federated voting, each node can do vote or accept or confirm. It kind of like sound similar and then, I mean, they do, but in federated voting, those three are very different actions. But anyway, what this means is that there are only two nodes, node A and node B. There are only three actions, voting, accepting, confirming, and there are only two transactions, transaction X and transaction Y. That means that at any point, there are only two times, three times two, so 12 possible actions that this network can take. And then it's actually even fewer than that because in federated voting, for instance, a node cannot confirm before accepting. So that's a detail that I'm not gonna go into. But the point is that at any point, there are so few actions to the point that Ivy can probably just brute force all possibilities and prove things. And so you can see how the simplification will make things easier for Ivy and for us also. So after simplifying a lot and then getting some simple steps done, we started to reintroduce complexity. So we increased the number of nodes, number of transactions and network apologies. And then each iteration actually taught us a lot. It was actually quite interesting to learn how Ivy understands the SCP or like each part of the SCP. And we also learned a lot about Ivy and also formal verification. And after those many iterations, we finally have a formally verified model for each of federated voting and the nomination protocol and the ballot protocol. So pretty much all parts of the stellar consensus protocol and all these models contain any number of nodes, any topology, some nodes could be these entity and it could process any number of valid transactions. Some of the next steps for this project are the first one is the automated test generation. The idea is that we have this formally verified model in Ivy. But that's not the one that's running in the stellar network. The one that's running in the stellar network is stellar core, which is written in C++ 14. The idea is that we wanna test this C++ implementation using the formally verified model. So that would give us a pretty interesting way of testing the implementation. Another one is to create an alternative ballot protocol model. We have a couple of approaches and we're thinking of creating a new one and more work on liveness properties. So what this means is that there are two properties that we care a lot about, one safety and the other ones liveness. So safety as in we wanna make sure that nodes agree with the transaction history, liveness is more like nodes can process new transactions. So if I oversimplify this, safety is a little bit more like well, people lose money. And liveness is more like can people make payments? So the idea is that let's say we come up with this very silly protocol, like a payment protocol that actually doesn't process any payments. Then it's safe in a way that nobody's gonna lose money or every node in the network will agree with the same transaction history because in this protocol, there is no payment even though it's a payment network or payment protocol. But then again, like nodes will agree that the transaction history is always empty and then it's like safe. But then of course that's fairly useless. We wanna make sure that the network can process transactions. So that's a liveness property. So that brings us to the end of this presentation. The main takeaway is that we should be asking what the simplest solution is that represents a significant step towards the full solution. So you don't wanna do something that's not super useful. You don't wanna do something that's kind of dumb. But then sometimes it's unclear that the full solution is actually what you want. So what you can do instead is to find something very, very simple. That's a pretty good step towards the full solution. And that way you can validate and you can learn from that iteration and now apply it to the next iteration. And then that is it. If you have any questions, Nikola and I will be available to answer those in the Q and A session after this presentation. Here are the links related to the topics that are discussed in this talk. We are hiring, Estella Development Foundation is hiring. So if you're interested, please check out that link for Estella.org Foundation careers. Thank you so much for your time today.