 So, now we are going towards Monte Carlo Tree Search MCTS. We clearly want to plan. We want to consider potential futures. But in a way, we will in the end only want to consider meaningful futures. There's like, if you consider the game of chess or something, there's just lots of really really bad ideas. In a way, that helps us choose a good action. Like we want to look at the futures that are relevant for what we can do now. Now there's two policies involved here. The first one is, build me a good tree. A tree that is informative about what I should do. The second one is, choose a good action based on the tree. As the tree is too big, let us only consider meaningful futures. In a way, what does that mean? We will want to assume that both players are not horrible, that they will generally do moves that are meaningful. We want to look at possible futures and look at Monte Carlo Tree Search. So, which trees do we want to explore? Do we want to explore just the best moves that will produce a very sparse tree? Well, we want to produce the moves that are most uncertain. That will mean that we will be following a lot of really bad moves. So in a way, what we really want to do is, we want to follow moves that could be the best moves. And that is something that we can view as an upper confidence bound strategy. Like every potential move, if we think about it, there's kind of like a feeling that we will have, that we will represent using a neural network about how good they are, and ultimately also how uncertain we are about how good they are. And we want to focus our effort on the moves that could be the best moves. And that is where the upper confidence bound UCB strategies, where they come from. So, let's first talk about the tree, what we want to do. Every node in the tree is really a pair of a state and an action, an SA pair. For that, what we will want to store is the average value, Q of S and A. We will also want to store the number of times we have chosen action A in state S. That is the count N of S and A. And the total number of visits then can be calculated, the number of visits to state S, is the sum of all actions of the number of times that we've taken that action in the state S. We will also have the policy, P of S and A. Now, let's talk about how we built that tree. We select a part of the tree in a first step. Then we will expand that by considering additional actions that we could have taken. We will then run a simulation where we do a roll out, where we go to the end of the tree and then ultimately do the roll out there. And then we will back up taking those results that we have at the end and propagating it back all the way to the relevant states and actions. And those are the steps that we have when we build these trees for Monte Carlo tree search. I should say this is a wonderful example of the Sutton and Bartow reinforcement learning book that I very highly recommend here this slide. Now, let's take the upper confidence bound idea that we mentioned before and make it more precise. What we want to do is we want to maximize U of S and A, which is the sum of the quality of the moves that we experienced so far, Q of S and A, plus a parameter C P U C T of... And then here we have the policy P of S and A, multiplied with square root of N of S divided over 1 plus N of S and A. So what do we have? The first term favors moves that are good. The second term favors moves that we haven't done many times. Not sure if N of S, A, the smaller of N of S, A is, the larger the term that we have on the right hand side and the more attractive to do this move. So the parameter here, C P U C T, upper confidence tree is what U C T stands for, and this parameter regulates the trade-off, how important it is to observe new things versus how important it is to follow moves that we think are good. So importantly, there's no need to play the full game. We either encounter the game end and its valuation or like in the end, I win, I draw, I lose, or alternatively, if we don't get to the end of the game, we can simply use the value estimate from our value network. It won't be perfect, it will get better over time, but it gives us an estimate there. And in the end, we will use deep learning to train both a policy and a value network. So now your role is to really understand how the tree is grown for Monte Carlo tree search.