 Today I'm here to talk about validating particles at the ETH one, and hopefully at the ETH two level using AI agents. So we focused on the AI technique known as reinforcement learning, and we may be wondering why are we looking at this, how does it fit, why are we looking at machine learning, and I'm going to explain all of this and talk you through why we decided to go that uprope. So in the past we've looked at something called selfish mining, which is a strategy and attack on the Ethereum 1.x vertical, where essentially you're trying to attack the network, manipulate the difficulty calculation, and you do this by essentially keeping a private chain that it's ahead. So essentially you don't send your blocks as soon as you receive them, but you will hold them and send them strategically as you keep getting heads from other miners. And so the idea is that you're going to make them waste your time by looking at the blocks that you just sent, and you always keep ahead so that you ensure that you're winning. And so this is a problem that's been studied in different papers, and there's different algorithms that you can follow. As I described before, you can hold a number of blocks, and whenever you have a lead, someone sends you a block that's increasing the general height, you publish a block that you've already mined for that height, and essentially always keeping ahead so that you ensure that you can maximize your reward that you're getting, because when you send them, you'll either get a main chain reward or an on-call reward, which are almost essentially the same. And so I work with the Pegasus team at Consensus, and we focus on research, so we've developed a framework to build simulations for different types of protocols. So we've looked at proof-of-work, proof-of-stake, and so we decided it would be pretty interesting to implement a selfish attack and look at the data that would return. So this is a bit of a code that it looks like. We actually took the calculation of the difficulty. We're looking at how it's done, and if you're ever late, just to get a sense of what the rewards are. And so depending on what you're doing, as I mentioned before, you can get a reward from mining a block, which is right now to ETH, with the transactions included. And if you include any of those that you receive. So your reward may vary to 175, 125, or two. And since the transactional fees that you receive is quite small sometimes, we decided to not include this in our own simulation. And so we decided to build a test case where you only have a network of nine miners who are always liners, and then you have an honest miner that is selfishly mining. And so what happens is that there's some variation, and we decided not to include the rewards of transactions, and some of the results are this. So essentially what we found is that once you have 25% of the hashing power of the network, you can start attacking and doing selfish mining, and this is more beneficial for you. Because this manipulates the difficulty calculation, and it increases your reward. So the more hashing power you have, the better for you. And so we wanted to test how would this prove if we're giving it to a reinforcement learning agent. And since we have actual data to back it up, we could compare the results and see what we would get with it. And so that's why you might be wondering why are we doing reinforcement learning when we're focused on watching it. But just to give you a sense of what reinforcement learning is, in case you haven't heard it, is the concept where you have a piece of software, it's an agent that's interacting with an environment. One of the most famous ones is DeepMind, where they implemented a deep queue network. And essentially it's interacting with different itinerary games and it learns how to solve it. So we thought, okay, we can take this concept and implement it into different protocols and just see what happens. And just to make a sense of why we chose that too is because it's an automated system that essentially has proven to beat humans and many other algorithms and strategies. So it just made sense to go that route. So how does it work? You have this loop where you have an agent that is collecting observations in the environment. So it's partial information on the current state. And then you have actions that you can take which will return a reward based on those actions. And so the process continues in some loop where you just keep iterating to different actions in different states. And so what are the different observations that you can get in terms of our protocol? We're looking at the number of blocks mined, the number of blocks we see, the percentage of on-calls, the different things that you're actually doing in the network and feeding that information into the agent so they can learn the impact of their actions. And so to reduce complexity we didn't want to be too broad in terms of what a miner can do. So we decided that you could only mine up to 10 blocks hidden in your own private chain. And you should be able to publish up to three blocks in a row so you could publish one, two, or three based on whatever strategy seems the best. And then you would get some reward based on that. And so I think the most complicated part of setting this up is trying to understand what is the best reward you can give to an agent because you want to give a reward such that it understands what's the optimal strategy to pick and also how to discover selfish mining without necessarily implementing the entire set of actions that it needs to do. And so what we decided to do in our case is to incentivize staying ahead, just mining as many blocks as possible faster than the other miners. But there's other strategies that you can take. We tried to test it a few to see what was the most optimal result but eventually just decided that using the metric of staying ahead was probably the best. And so going back to what this loop looks like, we have this agent that has been implemented in Python and it's calling code that is implemented in Java that works in the later. And it's essentially going through this loop over and over again, selecting actions and then collecting the information that it gets from the simulation. And so there's a number of different algorithms that you can use. As I mentioned before, like one of the most famous is the two networks. But you can also model free Q learning algorithms, policy grading based algorithms. They all play a different set of statistics and modeling and use market chains as the back of how they calculate different states and awards. And so we decided that probably the simplest one to go for was Q learning. And so I'm just going to explain a little bit what Q learning is essentially. So we use this Python framework. It's developed by OpenAI and it allows you to create your own environment and interface that you can use. And so we developed that calling the Java code and then we tested for many different scenarios. So we wanted to see what was the best strategy when you have 10% of the hashing power, 40% and 60%. And then we set some hyperparameters that are related to these Q tables and how you calculate the reward. So you look at the learning rate, which essentially is just telling you how much you value and new information that you receive versus information that you've recorded in the past. And we also have a discount factor, which also pertains to the reward. You want to look at the current reward that you get for an action plus the reward for making an X action because you're trying to discover what is the most optimal path. And so it's something that you have to tune in terms of how do you value a future reward because you could think, okay, I can mine a block right now, just send it as soon as possible and I get a reward of two. But if I decide to wait a couple of blocks ahead to publish it, there's time that goes by and so money might cost less because of things like inflation. Or you could just think, well, I mined four blocks in one minute, that's negligible, it's worth the same. So based on what you value most, you will tune this parameter. And then finally you have epsilon, which is just related to randomness. How many choices do you do that are random versus how many choices do you do that you're maximizing your strategy by making the biggest reward. And so there's two states, as I mentioned before. The first one is driven by the epsilon, where you're just trying to make random choices of the different actions you can take and recording the results that you get in this table. And then this will allow you to generate different values so that you can reach an optimal and deciding the best action at each state that you reach. And so there's some theory behind this that you use something called Bellman equations and it just says that basically by maximizing at each state that you find yourself, what is the optimal action, you will maximize for the entirety of your decision process. And so you're just looking as I mentioned before at the reward and then you're also adding the value of the future reward of the next state in action but with a disemptive factor. So that's what I was talking about before. That's zero point in value. And so just to make this a little bit more concrete because it can be very extract, we can think about this Q matrix being something like this where you have four different actions so you can hold your block and not publish it, just keep it in the private chain. You can publish one block, two or three, as I mentioned before, this is a different set of actions you can take. And then you can hold one, two, three, four and more but for simplicity in this example it puts it three. And so when you start to initialize it and all the values are at zero and as you keep going randomly through the actions you will change the values. So let's just walk through one of these examples. So let's say that you start in action you decide to hold and so you're going to calculate what happens when you hold and then maybe the next step you want to publish it. So that will generate a reward, let's say two even. And so you continue doing this through random paths until you've generated enough values, this is like random values, this is not the actual ones but just to give you an idea of what happens. And then after a certain point what you start doing is that okay I find myself at a state holding two so I have two blocks, what is my next best action? Should I publish them or should I hold them? And so you take the maximum value for each action and that's how you can find an optimal path. And so as I mentioned briefly there's the Markov chain process that backs all of this information and ensures that this equation holds true and it's just an idea so that you can also visualize this in a different way. You can have a chain where you're mining and then you can start your own private chain and keep switching between one and the other and then essentially because of the properties of a Markov chain you reach a steady state and that's why this table is useful because you can assume that after a certain number of steps you've reached the same if you were auto-affinity and so that's how you can maximize the values. And so there's a lot of potential challenges when you get this approach because you need to understand what's the best data that you need to feed, you need to test for all these hyperparameters and keep changing them until you can find something optimal. And it's also a consideration of how do you build this what are the behaviors that you want to incentivize how do you define something that's Byzantine and so on. There's a lot of things that you can find challenges with. And so just to go back to our example of proof of work we took observations, we were just tracking what is the number of blocks that you've published in the public chain, private chain, keeping track of on-calls and forks and then letting it know so that it can make the best actions. And so we read this similar to what we ran in the self-contained simulation, nine honest minors and then one Byzantine minor that is acting through the reinforcement learning agent. So we measured it at 10, 40 and 60 percent hashing power and we collected the results. And so as I mentioned before we used the simulator and drama, we used this framework, OpenAI and Python some of the typical machine learning libraries and then we ran all of these tests in AWS so that we could test and collect all the data. And so this is just again an overview of what we were doing at the different hashing rates. We collected the reward over a total of one hour of mining, we also collected it over the total of about 10,000 episodes or a simulation, so essentially I'll let my agent run for an hour or until it has mined a thousand blocks and record that behavior. And I run it for 10,000 episodes while you're changing the actions that you select. And we also measured for three different strategies. We just wanted to make sure that we were performing better than if you were just mining honestly or if that you were mining just randomly and we wanted to prepare the results. So what are the results look like? We found that when you ran it for even just like a thousand episodes the agent that was incentivized to hold his chain at 10% is actually performing worse because the reality is that at 10% you're not you're never going to be ahead of all the other miners you don't have enough power. So you're just end up losing more. However at 40% you already start getting an increasing your reward after maybe like 500 episodes you've already seen a little bit of an improvement in terms of what you're doing and then at 60% it's just like taking over the network. So the results, we summarized that here and we're also looking at this metric which is the ratio of blocks to mine. So this is how many blocks did you mine versus the number of total blocks mine. So what you see here is that for 10% you're doing worse than if you were just being honest at 40% though you start improving. We set two different reward systems just to test things and for one of them to be better at 40% while acting selfishly so he's getting like almost 50% of the blocks which is a slight increase and at 60% it's just mining all of the blocks because you get such a huge advantage from selfish mining that essentially you're publishing most of the blocks because nobody can catch up to you and what happens is that you're also decreasing the difficulty of blocks so you can mine faster with the same hashing rate. And so why are some of the key takeaways of all of this is that at 60% you're doing a lot of damage but we all know that even at that percent that you should be able to move a lot of damage. At 40% you start achieving some interesting results but one of our hopes was that by implementing this automated way we would find perhaps other flaws something different than selfish mining a different strategy and then at 10% it's just doing very poorly. And so like I mentioned before at 60% you're producing over 90% of the blocks but what's interesting is that you're not actually making any more money if anything you'll be making around the same amount of Ethereum. And so you're just attacking the difficulty but it's not necessarily beneficial. And we couldn't find a better strategy than selfish mining but it's interesting because then what you can do is use these types of systems to test different protocols and understand if there's something going on so perhaps you can see it as a signal detection system that there's something wrong with the protocol because if you don't follow the normal strategy you're doing better. And so this can help you understand and evaluate in different ways what's happening and you shouldn't just think that this is a wholesome solution but rather a good indicator of what can happen. And so this is an open source project it's all on GitHub so if anyone feels like they want to test their own protocols you're welcome to do it and yeah, that's it.