 So there are a number of simple solutions that we can think of. As you have discovered, there can be multiple approaches. One is we can simply select the best move, the one with the highest Q. One issue there might be that it might be relatively unstable, like it could be a move that we selected a very small number of times. Another one which is very conservative is we could choose the move that has the highest lower bound. There would be a move with relatively good Q and relatively small error bars. Alternatively, we can say we want to select the most visited one, which will correlate with the best. There's many other solutions in this place. Now that we've seen this, let's put the things together. What we will want to have is have a policy network. We want to have a value network. We want to have a Monte Carlo tree search system that goes all the way around it. And in fact, we can think of MCTS as policy refinement. We start with a policy that it will get from the neural network. We will run MCTS. It gives us visit counts. And then if we use these visit counts instead of the original policy, then we will have a better policy. And of course, we can iterate that idea. So when it comes to alpha zero, it improves policy by making it more similar to the visit count. The normalized visit count, pi of s, is the number of s and ai divided by the sum of all actions of NFSA. And what we want to do is we want to make them most similar, in deep learning that will generally mean that we need a cost function that measures how similar they are. So what we'll have here is the loss as a function of theta is the cross entropy of pi and p, which is just minus pi times log p. And this is now something that we can optimize meaningfully. And now that we have this algorithm, I want you to look at the implementation and see how it scales.