 So now that we have a fully specified value network, we still don't have the values. No, like we have the data, we spoke about the data and we have the network, we spoke about the network. But how can we train it? And it turns out that in the context of PyTorch, once we have specified the network and once we have specified the data, training is incredibly simple. So here we have the relevant function. Let's just go through it a little bit together. We have train value by self-planet. This system will be training itself. What do we have? We will need an optimizer, which is here we're using the atom optimizer. We have a loss criterion, the mean squared error loss. We have the losses and we're using graphic card optimization. That's where you see queued out there. And then we will go through the various epochs. We will get the outcomes. We will make the AI versus AI play. We will prepare the examples and then we will train the network. We will set the gradient to zero. We will estimate the values. We will calculate the losses. We'll calculate the gradient. Look, this is almost magic. This lost the backward. We will talk a lot about that in the future. And then we will do a step of optimization. And that is it, what we need for optimization. Let me remind you, the optimizer is something that we will talk about in week four. The losses and upsides and downsides of the various losses is something we'll talk about in week two. We have interesting self-play there. We have data formatting and augmentation that we will also speak about later. And we have automatic differentiation, which we'll talk about in week two, autographed. And if we run this, we can see that the mean squared error loss gets better over time, over the iteration. It doesn't smoothly get better. It jumps up and down a little bit for all kinds of reasons that we will understand later in the course. These graphs, we have loss as a function of iteration. Essential tools for us if we want to debug systems. There are cases where they go up. Well, then maybe we are wrong about the cost function. We might have wrongly implemented something. There might be extremely noisy in which case we might wanna worry about like, maybe our step sizes are too small. There are lots of different failure modes of neural networks that we can understand by looking at these curves. It's a crucial tool in that context. So what we did is we optimized the mean squared error loss. Think about the mean squared error loss. Why is this a good idea? Is this a good idea? What does it mean? Make sure you discuss in your part.